Current issues under Rocky 9 and docker

Hello,

I’ve been using netdata for a few years under CentOS 7 with docker. It was fine. The cloud stuff is super nice, congrats.

When moving to Rocky Linux 9, I noticed several issues, both on 1.36 and 1.37.1 versions:

1/ when launching the container, it spends 10 minutes at max CPU, no idea what it is doing at this time for so long
2/ the host is very seldomly reachable in the dashboard, meaning it might appear offline, or live but the charts won’t load. Notifications and such still work fine, but the charts cannot be loaded from the cloud app.
3/ after that initial surge in CPU, it stays all the time at around 10% cpu usage, which sounds like a lot to me. I know it’s nice’d, but still, I’m wondering because it seems it is an increase from previous versions. I think it is important to be true to your commitment of not using too much resources.

The hosts are 8 Go RAM 4 CPU servers, running a dozen containers each, behind a load balancer. The netdata service is launched with docker, with env vars for token, url and rooms, cap_add SYS_PTRACE, security_opt apparmor:unconfined and a few volumes read-only for os-release, proc, sys, and docker sock. Nothing fancy, default netdata config.

Please help me figure out what is wrong so I can browse the collected data from the cloud!

Loading the dashboard directly is super slow, with CPU hitting 100% for a few seconds, and then it doesn’t load fully. And this was not my experience before. So I’m guessing there is something wrong somewhere!

This is me loading netdata:

It is definitely not normal that the CPU gets to 100% for many minutes and that the interface isn’t responsive. What could be wrong?

Hey, @Smap. It looks similar to [Bug]: Netdata in docker hangs/high CPU load on Fedora · Issue #14062 · netdata/netdata · GitHub (the same symptoms). Check if Prerequisite steps for Fedora users will fix the issue for you.

1 Like

Hey @ilyam8. Thank you very much for pointing me in the right direction. I added the limits in docker-compose.yml file (with 1024 as a value), and after deleting the container and launching it again, it works clearly: no more hogging the CPU, and the charts display fine.

I’m glad this issue could be solved that fast, I will now happily continue to use netdata and recommend it around me! Will probably start paying for it too at some point :wink:

Best,
~Smap

1 Like

Note that it seems there are still some issues. The charts are not always available (node appears OFF). If I load the charts, they crash quickly. Only the nodes on Rocky have this issue.

So I believe the CPU hogging is solved, but the charts are still unstable.

I found this in the logs:

2022-12-23 12:28:12: apps.plugin LOG FLOOD PROTECTION resuming logging from process 'apps.plugin' (prevented 7608 logs in the last 3600 seconds).
2022-12-23 12:28:12: apps.plugin ERROR : MAIN : Cannot process /host/proc/659873/io (command 'runc') (errno 13, Permission denied)
2022-12-23 12:28:35: apps.plugin ERROR : MAIN : Cannot process /host/proc/661286/io (command 'containerd-shim') (errno 13, Permission denied)

I’m guessing this is yet another Docker quirk. Do you have an idea on what is wrong?

After a while, it works again (appears on the cloud dashboard).

No, those logs are not relevant. I have no problem on Fedora37 if open fd limit is set. Will check on Rocky next week.

Thanks Ilya. Happy holidays!

1 Like

Just started a docker container on Rocky 9, no issues so far.

The charts are not always available (node appears OFF).

I guess you are referring to the Cloud UI? Have you checked the local dashboard?

When accessing the dashboard directly, it takes a long time to load, with high CPU load, and scrolling down, I reach a point where the graphs are no longer loading. This is my CPU (4 cores, 8 Go RAM) when loading the dashboard:

disk:

Screenshots were taken from cloud ui. Shortly after, the message “Something went wrong” appears on all graphs as node goes offline and disappears from the list of active nodes.

@ilyam8 still no issues on your side?

No. Can you send me a snapshot to ilya@netdata.cloud?

Sent! Thank you very much.

@Smap I checked the snapshot and it is Haproxy that uses almost all the CPU time on your screenshot. You can check it yourself in the “Applications” section, or “haproxy” container.

I have no idea why Haproxy uses so much CPU on your VM, I suggest you debug it.

Not sure why I don’t receive notifications from this thread :confused:

Thanks for looking into it, I’ll report here if I find something with haproxy.

So I have progressed slightly. The issue is that an HAProxy thread gets stuck and HAProxy gets killed, exiting with code 134. Meaning it is using too much resource when loading netdata (I think).

The problem is I don’t see where this is coming from exactly.

We cannot rule out a bug in HAProxy as this kind of issue with a stuck thread happened in the past, but I’m on a recent version (2.6.7) and going to assume that it is not a bug but a result of some configuration/behavior.

Looking at the limits of the HAProxy processes in the container:

/ $ cat /proc/1/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1073741816           1073741816           files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       30426                30426                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us  

We can see that they are quite high.

This is what the HAProxy crash looks like:

Thread 3 is about to kill the process.
 >Thread 1 : id=0x7f4f8a398860 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/1    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=869618788 now=2047053461 diff=1177434673
             curr_task=0x7f4f893693e0 (task) calls=1 last=0
               fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f0f50
             strm=0x7f4f894f0f50,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
             txn=0x7f4f893694b0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
             scf=0x7f4f89506aa0,EST,20 scb=0x7f4f89506b10,INI,21
             af=0,0 sab=0,0
             cof=0x7f4f89368e40,80000300:H1(0x7f4f89369110)/RAW(0)/tcpv4(23)
             cob=0,0:NONE(0)/NONE(0)/NONE(-1)

 >Thread 2 : id=0x7f4f894e5b30 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/2    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=389881 now=2236720753 diff=2236330872
             curr_task=0x7f4f893673c0 (task) calls=1 last=0
               fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f1340
             strm=0x7f4f894f1340,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
             txn=0x7f4f893674b0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
             scf=0x7f4f89508190,EST,20 scb=0x7f4f89508200,INI,21
             af=0,0 sab=0,0
             cof=0x7f4f89368f30,80000300:H1(0x7f4f893670f0)/RAW(0)/tcpv4(24)
             cob=0,0:NONE(0)/NONE(0)/NONE(-1)

*>Thread 3 : id=0x7f4f894c0b30 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/3    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=431501 now=2002350879 diff=2001919378
             curr_task=0x7f4f893686c0 (task) calls=1 last=0
               fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f0770
             strm=0x7f4f894f0770,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
             txn=0x7f4f89368790,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
             scf=0x7f4f89505ba0,EST,20 scb=0x7f4f89505c10,INI,21
             af=0,0 sab=0,0
             cof=0x7f4f893aa2e0,80000300:H1(0x7f4f89368120)/RAW(0)/tcpv4(20)
             cob=0,0:NONE(0)/NONE(0)/NONE(-1)

 >Thread 4 : id=0x7f4f8949bb30 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/4    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=2597462 now=2133429953 diff=2130832491
             curr_task=0x7f4f893aac50 (task) calls=1 last=0
               fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f0380
             strm=0x7f4f894f0380,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
             txn=0x7f4f893aad30,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
             scf=0x7f4f8990d490,EST,20 scb=0x7f4f89505890,INI,21
             af=0,0 sab=0,0
             cof=0x7f4f893aa3d0,80000300:H1(0x7f4f893aa010)/RAW(0)/tcpv4(21)
             cob=0,0:NONE(0)/NONE(0)/NONE(-1)

[NOTICE]   (1) : haproxy version is 2.6.7-c55bfdb
[ALERT]    (1) : Current worker (8) exited with code 134 (Aborted)
[ALERT]    (1) : exit-on-failure: killing every processes with SIGTERM
[WARNING]  (1) : All workers exited. Exiting... (134)

It happens almost immediatly after loading netdata, and happens 100% of the time, so it is clear that netdata is doing something that causes HAProxy to fail. Thanksfully, haproxy will restart itself after crashing.

I tried adding nproc limit to the netdata container, or changing nofile from 1024 to 2048 or 512, with no success.

I tried adding nbthread 1 to the HAProxy config, and this seems to work, meaning that HAProxy doesn’t crash anymore, probably because you can’t have thread lock issues with only one thread (taps head).

But the graphs still use a lot of CPU during load, takes a long time to load, and will become unavailable pretty quickly.

So IMHO this is really a netdata issue, with way too much CPU used for some reason when the graphs are loaded either through cloud or direct.

I have brought back haproxy config to use 4 threads otherwise users will likely see a perf impact.

I will probably bring this to the attention of HAProxy maintainers at some point, as it is quite reliable crash conditions and probably shouldn’t happen. But at the moment there is still uncertainety as to what exactly is causing the threads to lock and haproxy to crash.

Just wanted to share here in case someone has an idea on what to look for, or why this only happens with netdata simply trying to load the netdata page.

Finally found the issue of high CPU usage, in haproxy, it was linked with the basic auth that was using a password with 650000 rounds of hashing. Removing this or using a hash wish 1000 rounds would solve the issue.

For more information see: Thread stuck and crash with basic auth backend · Issue #1298 · haproxy/haproxy · GitHub

Glad you found it! And thanks for sharing your findings.

1 Like