Critical - Netdata CPU Leak - had to shut it off on hundreds of nodes

Recently Netdata has been going rogue and we had to completely turn it off on hundreds of nodes. It often climbs to using more than half the CPU resources available and the CPU PSI going from 1% to +10%.

Previous posts did not shed any useable information on how to identify the root cause for this instance.

Which netdata version you use?

v2.0.0-182-nightly mostly

hm… A couple of days ago we merged a big set of changes. We reworked everything about streaming to actually lower cpu consumption at scale and increase Netdata’s ability to handle a lot more children. We have already found a couple of problems and we fixed them. We are currently testing these changes.

If the tests finish without issues, today we will merge a new nightly with all the fixes.

Are you able to build from source? Do you want to try them somewhere before merging, or you prefer to wait for the next nightly (most likely tomorrow)?

btw, the new version has more replication threads. Prior there was only 1 replication thread (and you had to manually increase it in netdata.conf), but now there are more depending on the number of cores you have.

If you experience increased CPU because you restarted a parent or some children, this is normal. It will finish and settle back to normal.

btw, tests are going well. Most likely tonight the new fixes will be merged.

We will be changing to stable and setting strict resource limitations for Netdata Agent. Is there a reason why it doesn’t restrict itself by default?

@Slind14 Can you check which treads are using CPU? See how to do it with htop.

That doesn’t seem to work:

image

What doesn’t seem to work? Why did you decide to show ebpf.plugin threads?

I apologize for the confusion. I was simply using that htop link as an example to demonstrate how to identify CPU-intensive threads. I didn’t intend to ask for a breakdown of ebpf.plugin threads. Can you find which threads use a lot of CPU?

We already capped it and I can’t easily get a reading from before kernel priority adjustments and constraints. Here is the current snapshot where it is limited to idle resources and one core. With this I see it constantly spending +90 CPU% on health while the other threads fluctuate a lot and are averaged out low.

After some time:

limited to idle resources and one cor

Are those screenshots from the Parent or Child instance?

These are on childs without parents. So single instance setups. We have decided to uninstall Netdata for now (we primarily use node_exporter for these 250 instances anyway), unfortunately we could not reliably constrain/can’t put in the time for it now and it causes too much risk.

Just wanted to let you know that on the test node where we kept it v2.0.0-205-nightly is still consuming 5-6 cores consistently.

All pending issues have been fixed. Can you please install the latest nightly? Is it working as it should after this?