NETDATA Stop Responding/ System Stuck - heartbeat clock: woke up 3863941 microseconds later than expected

Hi,

I am having an issue with Time/Clock ?
I am getting a lot of heartbeat clock: woke up 3863941 microseconds later than expected and the systems is crashing

Timedatectl and chronyd is ok and synced, what should I check ?

Feb 09 15:56:36 NetData ebpf.plugin[1966]: heartbeat clock: woke up 3886715 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 09 15:56:36 NetData apps.plugin[1967]: heartbeat clock: woke up 3863941 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).

Due to that i have disable netdata because causing a lot of CPU HIGH and systems is stuck

Hello @pgro ,

We are reworking with eBPF these days ([Feat]/[BUG]: eBPF memory improvement · Issue #14477 · netdata/netdata · GitHub), so let me do a question about your environment to understand the motive you are having these erros, because before 1.38 was released, I modified one of these plugins. It is possible they have a common issue, because they use the same librariies.

Which version of netdata you are running right now?
How many “PIDs” you have in your system when you run the command ?

root@hades:/home/thiago/Netdata/go.d.plugin# ls -l /proc/ | grep ^d | wc -l
318

best regards!

Thank you for your response,

I was updated to 1.38 but the issue was discovered back to 23/Nov with 1.37 version but still then I didn’t had CPU HOGs and slowdown. At the moment I reverted back with apt command to 1.37 latest night

rsion: netdata v1.37.0-260-nightly
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${pre
fix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--l
ibexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g
 -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_S
OURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
    Binary architecture: x86_64
    Packaging distro:
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES
Debug/Developer Features:
    Trace Allocations:       NO

Regarding Procs

root@NetData:/etc/netdata# ls -l /proc/ | grep ^d | wc -l
136

This is a Ubuntu

root@NetData:/etc/netdata# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.5 LTS
Release:        20.04
Codename:       focal

Even with 1.37 I am still having issues and i am forced to stop the service

EDIT : This is also happening with other module/plugin

Feb 09 17:16:35 NetData nfacct.plugin[4639]: heartbeat clock: woke up 2571663 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).

I got an additional Kernel Panic to all this as below

eb 10 14:05:59 NetData apps.plugin[16039]: Using now_boottime_usec() for uptime (dt is 5 ms)
Feb 10 14:06:23 NetData nfacct.plugin[16043]: heartbeat clock: woke up 894116 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:06:23 NetData ebpf.plugin[16044]: heartbeat clock: woke up 831291 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:06:42 NetData nfacct.plugin[16043]: heartbeat clock: woke up 506099 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:06:42 NetData ebpf.plugin[16044]: heartbeat clock: woke up 1359574 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:06:59 NetData sudo[15785]: pam_unix(sudo:session): session closed for user root
Feb 10 14:07:23 NetData kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:0:15107]
Feb 10 14:07:23 NetData kernel: Modules linked in: ufs msdos xfs wireguard ip6_udp_tunnel udp_tunnel nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfnetlink_acct nfnetlink isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua binfmt_misc joydev input_leds serio_raw sch_fq_codel msr efi_pstore ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid bochs_drm drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm psmouse virtio_net net_failover failover virtio_blk floppy
Feb 10 14:07:23 NetData kernel: CPU: 1 PID: 15107 Comm: kworker/1:0 Not tainted 5.4.0-137-generic #154-Ubuntu
Feb 10 14:07:23 NetData kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Feb 10 14:07:23 NetData kernel: Workqueue: events drm_fb_helper_dirty_work [drm_kms_helper]
Feb 10 14:07:23 NetData kernel: RIP: 0010:smp_call_function_many+0x205/0x270
Feb 10 14:07:23 NetData kernel: Code: e8 a0 8d 93 00 3b 05 4e 82 b0 01 89 c7 0f 83 9b fe ff ff 48 63 c7 48 8b 0b 48 03 0c c5 80 e9 64 b9 8b 41 18 a8 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c8 89 cf 48 c7 c2 60 9f e4 b9 4c 89 fe
Feb 10 14:07:23 NetData kernel: RSP: 0018:ffffa9e680a77c70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Feb 10 14:07:23 NetData kernel: RAX: 0000000000000003 RBX: ffff961a3eab0e00 RCX: ffff961a3ea37100
Feb 10 14:07:23 NetData kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
Feb 10 14:07:23 NetData kernel: RBP: ffffa9e680a77cb0 R08: ffff961a3a434460 R09: 0000000000000000
Feb 10 14:07:23 NetData kernel: R10: ffff961a3a434888 R11: 0000000000000000 R12: ffffffffb82829b0
Feb 10 14:07:23 NetData kernel: R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000004
Feb 10 14:07:23 NetData kernel: FS:  0000000000000000(0000) GS:ffff961a3ea80000(0000) knlGS:0000000000000000
Feb 10 14:07:23 NetData kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 10 14:07:23 NetData kernel: CR2: 000000c000c48000 CR3: 000000017eb70000 CR4: 00000000000006e0
Feb 10 14:07:23 NetData kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 10 14:07:23 NetData kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 10 14:07:23 NetData kernel: Call Trace:
Feb 10 14:07:23 NetData kernel:  ? load_new_mm_cr3+0xf0/0xf0
Feb 10 14:07:23 NetData kernel:  on_each_cpu+0x2d/0x60
Feb 10 14:07:23 NetData kernel:  flush_tlb_kernel_range+0x38/0x90
Feb 10 14:07:23 NetData kernel:  __purge_vmap_area_lazy+0x70/0x6d0
Feb 10 14:07:23 NetData kernel:  free_vmap_area_noflush+0xe1/0xf0
Feb 10 14:07:23 NetData kernel:  remove_vm_area+0x9a/0xb0
Feb 10 14:07:23 NetData kernel:  iounmap+0x86/0xc0
Feb 10 14:07:23 NetData kernel:  ttm_bo_kunmap+0xac/0xd0 [ttm]
Feb 10 14:07:23 NetData kernel:  drm_gem_vram_object_vunmap+0x24/0x40 [drm_vram_helper]
Feb 10 14:07:23 NetData kernel:  drm_gem_vunmap+0x28/0x50 [drm]
Feb 10 14:07:23 NetData kernel:  drm_client_buffer_vunmap+0x1a/0x30 [drm]
Feb 10 14:07:23 NetData kernel:  drm_fb_helper_dirty_work+0x18f/0x1b0 [drm_kms_helper]
Feb 10 14:07:23 NetData kernel:  process_one_work+0x1eb/0x3b0
Feb 10 14:07:23 NetData kernel:  worker_thread+0x4d/0x400
Feb 10 14:07:23 NetData kernel:  kthread+0x104/0x140
Feb 10 14:07:23 NetData kernel:  ? process_one_work+0x3b0/0x3b0
Feb 10 14:07:23 NetData kernel:  ? kthread_park+0x90/0x90
Feb 10 14:07:23 NetData kernel:  ret_from_fork+0x35/0x40
Feb 10 14:07:23 NetData ebpf.plugin[16044]: heartbeat clock: woke up 30679669 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:07:23 NetData nfacct.plugin[16043]: heartbeat clock: woke up 31708706 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:07:33 NetData nfacct.plugin[16043]: heartbeat clock: woke up 7131562 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:07:33 NetData ebpf.plugin[16044]: heartbeat clock: woke up 9105841 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:07:46 NetData ebpf.plugin[16044]: heartbeat clock: woke up 11209102 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).
Feb 10 14:07:46 NetData apps.plugin[16039]: heartbeat clock: woke up 10356680 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).

Hello @pgro,

About this kernel panic, do you have any netdata core dump on your system? If you have, please, can you share more details about it?

Best regards!

@pgro hi!

Also, besides the info from Thiago, can you please export a snapshot (Import, export, and print a snapshot | Learn Netdata) while you’re observing the issue and send it to manolis@netdata.cloud?

Thanks

Hi Manoli,

This is the latest log from journalctl. At the moment I am not facing any stuck or slowdown. But I had to configure also chrony daemon , I am not sure if this might be issue running under Proxmox Hypervisor.

Feb 14 02:47:10 NetData nfacct.plugin[38314]: NFACCT process exiting
Feb 14 06:23:55 NetData ebpf.plugin[34168]: heartbeat clock: woke up 881445 microseconds later than expected (can be due to s
ystem load or the CLOCK_REALTIME set to the future).
Feb 14 06:23:55 NetData apps.plugin[34166]: heartbeat clock: woke up 762508 microseconds later than expected (can be due to s
ystem load or the CLOCK_REALTIME set to the future).
Feb 14 06:43:05 NetData systemd[1]: Stopping Real time performance monitoring...
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34211 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34214 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34209 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34217 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34215 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34213 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34208 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34210 finished
Feb 14 06:43:05 NetData ebpf.plugin[34168]: thread with task id 34212 finished
Feb 14 06:43:06 NetData apps.plugin[41906]: PROCFILE: Cannot open file '/etc/netdata/apps_groups.conf'
Feb 14 06:43:06 NetData apps.plugin[41906]: Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Wi
ll try '/usr/lib/netdata/conf.d/apps_groups.conf'
Feb 14 06:43:06 NetData apps.plugin[41906]: Loaded config file '/usr/lib/netdata/conf.d/apps_groups.conf'
Feb 14 06:43:06 NetData apps.plugin[41906]: started on pid 41906
Feb 14 06:43:06 NetData apps.plugin[41906]: set name of thread 41911 to APPS_READER
Feb 14 06:43:07 NetData apps.plugin[41906]: Using now_boottime_usec() for uptime (dt is 2 ms)
Feb 14 06:43:11 NetData systemd[1]: netdata.service: Succeeded.
Feb 14 06:43:11 NetData systemd[1]: Stopped Real time performance monitoring.
Feb 14 06:43:11 NetData systemd[1]: Started Real time performance monitoring.
Feb 14 06:43:11 NetData netdata[41939]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with
internal defaults.
Feb 14 06:43:11 NetData netdata[41939]: 2023-02-14 06:43:11: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/li
b/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 14 06:43:14 NetData ebpf.plugin[42226]: Does not have a configuration file inside `/etc/netdata/ebpf.d.conf. It will try
to load stock file.
Feb 14 06:43:14 NetData perf.plugin[42224]: no charts enabled - nothing to do.
Feb 14 06:43:14 NetData apps.plugin[42223]: PROCFILE: Cannot open file '/etc/netdata/apps_groups.conf'
Feb 14 06:43:14 NetData apps.plugin[42223]: Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Wi
ll try '/usr/lib/netdata/conf.d/apps_groups.conf'
Feb 14 06:43:14 NetData apps.plugin[42223]: Loaded config file '/usr/lib/netdata/conf.d/apps_groups.conf'
Feb 14 06:43:14 NetData apps.plugin[42223]: started on pid 42223
Feb 14 06:43:14 NetData apps.plugin[42223]: set name of thread 42244 to APPS_READER
Feb 14 06:43:14 NetData ebpf.plugin[42226]: Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Wi
ll try '/usr/lib/netdata/conf.d/apps_groups.conf'
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42264
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42264 to EBPF HARDIRQ
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42267
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42267 to EBPF SHM
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42262
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42262 to EBPF FD
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42265
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42265 to EBPF SOFTIRQ
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42266
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42266 to EBPF OOMKILL
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread with task id 42266 finished
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42259
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42259 to EBPF SYNC
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42260
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42260 to EBPF SWAP
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42261
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42261 to EBPF MOUNT
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42258
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42258 to EBPF CACHESTAT
Feb 14 06:43:14 NetData ebpf.plugin[42226]: thread created with task id 42257
Feb 14 06:43:14 NetData ebpf.plugin[42226]: set name of thread 42257 to EBPF PROCESS
Feb 14 06:43:15 NetData apps.plugin[42223]: Using now_boottime_usec() for uptime (dt is 6 ms)
Feb 14 06:44:15 NetData ebpf.plugin[42226]: heartbeat clock: woke up 18534517 microseconds later than expected (can be due to
 system load or the CLOCK_REALTIME set to the future).
Feb 14 10:43:15 NetData nfacct.plugin[42230]: NFACCT process exiting

Snapshot sent by email.

Thank you so far for your support.

Hi @Thiago_Marques_0

Kernel panic exist also as core dump but not happened all the time. So far the below files have been created, would you like to send them?

Unfortunately with gdb $(which netdata) …file… didn’t work

NetData:/var/crash$ ls -lah
total 38M
drwxrwxrwt  2 root    root    4.0K Feb  9 14:52 .
drwxr-xr-x 13 root    root    4.0K Oct 26 00:40 ..
-rw-r-----  1 root    root    4.0M Feb  9 14:52 _usr_libexec_netdata_plugins.d_ebpf.plugin.0.crash
-rw-r-----  1 root    root     17M Feb  9 14:25 _usr_sbin_netdata.0.crash
-rw-r-----  1 netdata netdata  18M Feb  9 09:12 _usr_sbin_netdata.113.crash

Gday Everyone,

Do we have any clue why this happening? Does this happening due to Clock? or any misconfiguration with VM ?


 1 eb 17 16:42:22 NetData ebpf.plugin[74534]: set name of thread 74578 to EBPF MOUNT                                                                                                                             
  2 Feb 17 16:42:22 NetData ebpf.plugin[74534]: thread created with task id 74577                                                                                                                                 
  3 Feb 17 16:42:22 NetData ebpf.plugin[74534]: set name of thread 74577 to EBPF SWAP                                                                                                                             
  4 Feb 17 16:42:23 NetData apps.plugin[74524]: Using now_boottime_usec() for uptime (dt is 4 ms)                                                                                                                 
  5 Feb 17 16:43:07 NetData ebpf.plugin[74534]: heartbeat clock: woke up 1428778 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future

Hi everyone,

Seeing my logs flooded with these recently as well. Otherwise, things appear to be working normally

Hello @itsthejb ,

Yesterday we merged a PR fixing some issues like this you described, so I have some questions for you:

1 - Which Linux distribution are you running?
2 - What is your netdata version netdata -V?
3 - How did you install netdata?
4 - eBPF monitors inside kernel, so we also need to know your kernel version uname -a?

Thank you to contact us!

Hi there,

Happy to help!

1 - Ubuntu 22.10
2 - v1.38.1
3 - Through Ansible (GitHub - netdata/community: Netdata-powered applications and examples. For the community, by the community.), and cron daily updater
4 - Linux <****> 5.19.0-40-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 23 21:39:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Please let me know if you need anything else

Thank you for your answers.

Considering that you are running our current stable version, you still do not have the bug fix on your hosts.
We are preparing a new release for next days and as soon your computer updates, you will have the bug fixed.

Best regards!

Glad to hear it! Thanks

Hi there. It’s been some time, and still experiencing this issue. Newest version (current=00104000000070 >= latest=00104000000070) is already installed

:man_shrugging:

Months later and no movement on this issue. In my case, I’ve just disabled the ebpf plugin…

Hello @itsthejb ,

We are working to fix these issues with the PR Improve eBPF performance by thiagoftsm · Pull Request #16030 · netdata/netdata · GitHub.
We are converting threads to use the function functionality and these won’t happen anymore. We are also reducing memory usage with changes.

Best regards!