trying to figure out why some of our physical servers were seeing unusual high load for an equally unusual long time, I noticed that netdata consumes at least 100% of one core on the VMs running on those physical servers.
See for yourself:
Those are just 4 of the VMs affected, but I saw the same phenomenon on every single VM that runs netdata. I’ve since updated the VMs and rebooted them, all to no avail: netdata still eats at least one core.
Any ideas what’s wrong there?
The affected VMs run on up to date debian bullseye, the same goes for netdata:
$ dpkg -l netdata
ii netdata 1.44.0-205-nightly amd64 real-time charts for system monitoring
Hey, consider using htop to find out which thread that is.
How long has it been 100%?
difficult to say, but from what I see in the limited historical metrics that proxmox offers, this seems to be going on for a couple of weeks, if not longer.
In the netdata console, “Systemd Services CPU utilization” is where I see that netdata is eating 100% CPU, but I didn’t find a possibility to visualize historical data there.
Hey, I’m experiencing the same problem here it seems.
One of my agents spends 100% of a single core in an ACLKSYNC thread.
Netdata 1.44.1, RockyLinux 9.
rm -rf /var/cache/netdata/* solved the problem.
@udotirol hey, do you see 100% right after Netdata restart or after some time?
it happens practically instantly after I reboot the entire VM or just restart Netdata
@udotirol are you on Discord? If yes, can you please join Discord? It is the same issue, it will be easier to debug it there
yes, I’ve just joined the server right now. But as @avh mentioned, a workaround seems to be to remove /var/cache/netdata
I haven’t done so yet, in case you want to debug the issue. If so, I am happy to proceed on Discord
in case you want to debug the issue
Nice, because Jacob (Discord OP) removed
@udotirol can you please send your
ok, I’ve just sent the email.
Just registered to follow this. We have netdata on a few dozen servers in an LSF environment. Centos-7 based 64-core Epyc servers with 1TB+ memory, generally running 100% loaded. I see netdata at 100% - 300% CPU usage and memory sometimes up to around 50GB.
Tried clearing out
/var/cache/netdata and initial results are promising; any thoughts on what we’re losing by clearing that out?
Edit – I see that what I’m losing is my history of usage :). That’s a bit of a loss.