Critical - Netdata CPU Leak - had to shut it off on hundreds of nodes

Slind14 · December 8, 2024, 4:35pm

Recently Netdata has been going rogue and we had to completely turn it off on hundreds of nodes. It often climbs to using more than half the CPU resources available and the CPU PSI going from 1% to +10%.

Previous posts did not shed any useable information on how to identify the root cause for this instance.

Costa_Tsaousis · December 8, 2024, 4:37pm

Which netdata version you use?

Slind14 · December 8, 2024, 4:39pm

v2.0.0-182-nightly mostly

Costa_Tsaousis · December 8, 2024, 5:07pm

hm… A couple of days ago we merged a big set of changes. We reworked everything about streaming to actually lower cpu consumption at scale and increase Netdata’s ability to handle a lot more children. We have already found a couple of problems and we fixed them. We are currently testing these changes.

If the tests finish without issues, today we will merge a new nightly with all the fixes.

Are you able to build from source? Do you want to try them somewhere before merging, or you prefer to wait for the next nightly (most likely tomorrow)?

Costa_Tsaousis · December 8, 2024, 6:23pm

btw, the new version has more replication threads. Prior there was only 1 replication thread (and you had to manually increase it in netdata.conf), but now there are more depending on the number of cores you have.

If you experience increased CPU because you restarted a parent or some children, this is normal. It will finish and settle back to normal.

btw, tests are going well. Most likely tonight the new fixes will be merged.

Slind14 · December 9, 2024, 1:04am

We will be changing to stable and setting strict resource limitations for Netdata Agent. Is there a reason why it doesn’t restrict itself by default?

ilyam8 · December 9, 2024, 9:38am

@Slind14 Can you check which treads are using CPU? See how to do it with htop.

Slind14 · December 9, 2024, 3:40pm

That doesn’t seem to work:

ilyam8 · December 9, 2024, 5:51pm

~~What doesn’t seem to work? Why did you decide to show ebpf.plugin threads?~~

I apologize for the confusion. I was simply using that htop link as an example to demonstrate how to identify CPU-intensive threads. I didn’t intend to ask for a breakdown of ebpf.plugin threads. Can you find which threads use a lot of CPU?

Slind14 · December 9, 2024, 8:53pm

We already capped it and I can’t easily get a reading from before kernel priority adjustments and constraints. Here is the current snapshot where it is limited to idle resources and one core. With this I see it constantly spending +90 CPU% on health while the other threads fluctuate a lot and are averaged out low.

Slind14 · December 9, 2024, 10:48pm

After some time:

ilyam8 · December 10, 2024, 9:28am

limited to idle resources and one cor

Are those screenshots from the Parent or Child instance?

Slind14 · December 10, 2024, 9:54pm

These are on childs without parents. So single instance setups. We have decided to uninstall Netdata for now (we primarily use node_exporter for these 250 instances anyway), unfortunately we could not reliably constrain/can’t put in the time for it now and it causes too much risk.

Slind14 · December 12, 2024, 12:59am

Just wanted to let you know that on the test node where we kept it v2.0.0-205-nightly is still consuming 5-6 cores consistently.

Costa_Tsaousis · December 14, 2024, 3:43am

All pending issues have been fixed. Can you please install the latest nightly? Is it working as it should after this?

Topic		Replies	Views
Ebpf.plugin has increasingly high CPU usage Help agent	0	807	December 30, 2020
Apps.plugin high CPU usage Help agent	14	3937	June 18, 2021
WSL2 monitoring netdata Help agent-installation , windows-monitoring , agent	1	1106	August 12, 2020
Netdata updater saturating RAM and CPU Help agent , installation , update	6	585	March 31, 2022
High CPU use of Netdata Help	7	969	January 28, 2024

Critical - Netdata CPU Leak - had to shut it off on hundreds of nodes

Related topics