Recently Netdata has been going rogue and we had to completely turn it off on hundreds of nodes. It often climbs to using more than half the CPU resources available and the CPU PSI going from 1% to +10%.
hm… A couple of days ago we merged a big set of changes. We reworked everything about streaming to actually lower cpu consumption at scale and increase Netdata’s ability to handle a lot more children. We have already found a couple of problems and we fixed them. We are currently testing these changes.
If the tests finish without issues, today we will merge a new nightly with all the fixes.
Are you able to build from source? Do you want to try them somewhere before merging, or you prefer to wait for the next nightly (most likely tomorrow)?
btw, the new version has more replication threads. Prior there was only 1 replication thread (and you had to manually increase it in netdata.conf), but now there are more depending on the number of cores you have.
If you experience increased CPU because you restarted a parent or some children, this is normal. It will finish and settle back to normal.
btw, tests are going well. Most likely tonight the new fixes will be merged.
What doesn’t seem to work? Why did you decide to show ebpf.plugin threads?
I apologize for the confusion. I was simply using that htop link as an example to demonstrate how to identify CPU-intensive threads. I didn’t intend to ask for a breakdown of ebpf.plugin threads. Can you find which threads use a lot of CPU?
We already capped it and I can’t easily get a reading from before kernel priority adjustments and constraints. Here is the current snapshot where it is limited to idle resources and one core. With this I see it constantly spending +90 CPU% on health while the other threads fluctuate a lot and are averaged out low.
These are on childs without parents. So single instance setups. We have decided to uninstall Netdata for now (we primarily use node_exporter for these 250 instances anyway), unfortunately we could not reliably constrain/can’t put in the time for it now and it causes too much risk.