So I have progressed slightly. The issue is that an HAProxy thread gets stuck and HAProxy gets killed, exiting with code 134. Meaning it is using too much resource when loading netdata (I think).
The problem is I don’t see where this is coming from exactly.
We cannot rule out a bug in HAProxy as this kind of issue with a stuck thread happened in the past, but I’m on a recent version (2.6.7) and going to assume that it is not a bug but a result of some configuration/behavior.
Looking at the limits of the HAProxy processes in the container:
/ $ cat /proc/1/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes unlimited unlimited processes
Max open files 1073741816 1073741816 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 30426 30426 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
We can see that they are quite high.
This is what the HAProxy crash looks like:
Thread 3 is about to kill the process.
>Thread 1 : id=0x7f4f8a398860 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/1 stuck=1 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=869618788 now=2047053461 diff=1177434673
curr_task=0x7f4f893693e0 (task) calls=1 last=0
fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f0f50
strm=0x7f4f894f0f50,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
txn=0x7f4f893694b0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
scf=0x7f4f89506aa0,EST,20 scb=0x7f4f89506b10,INI,21
af=0,0 sab=0,0
cof=0x7f4f89368e40,80000300:H1(0x7f4f89369110)/RAW(0)/tcpv4(23)
cob=0,0:NONE(0)/NONE(0)/NONE(-1)
>Thread 2 : id=0x7f4f894e5b30 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/2 stuck=1 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=389881 now=2236720753 diff=2236330872
curr_task=0x7f4f893673c0 (task) calls=1 last=0
fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f1340
strm=0x7f4f894f1340,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
txn=0x7f4f893674b0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
scf=0x7f4f89508190,EST,20 scb=0x7f4f89508200,INI,21
af=0,0 sab=0,0
cof=0x7f4f89368f30,80000300:H1(0x7f4f893670f0)/RAW(0)/tcpv4(24)
cob=0,0:NONE(0)/NONE(0)/NONE(-1)
*>Thread 3 : id=0x7f4f894c0b30 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/3 stuck=1 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=431501 now=2002350879 diff=2001919378
curr_task=0x7f4f893686c0 (task) calls=1 last=0
fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f0770
strm=0x7f4f894f0770,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
txn=0x7f4f89368790,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
scf=0x7f4f89505ba0,EST,20 scb=0x7f4f89505c10,INI,21
af=0,0 sab=0,0
cof=0x7f4f893aa2e0,80000300:H1(0x7f4f89368120)/RAW(0)/tcpv4(20)
cob=0,0:NONE(0)/NONE(0)/NONE(-1)
>Thread 4 : id=0x7f4f8949bb30 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/4 stuck=1 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=2597462 now=2133429953 diff=2130832491
curr_task=0x7f4f893aac50 (task) calls=1 last=0
fct=0x55b33bacaa00(process_stream) ctx=0x7f4f894f0380
strm=0x7f4f894f0380,c08 src=192.168.1.76 fe=http-in be=netdata dst=unknown
txn=0x7f4f893aad30,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
rqf=64d08002 rqa=60900 rpf=a0000000 rpa=20000000
scf=0x7f4f8990d490,EST,20 scb=0x7f4f89505890,INI,21
af=0,0 sab=0,0
cof=0x7f4f893aa3d0,80000300:H1(0x7f4f893aa010)/RAW(0)/tcpv4(21)
cob=0,0:NONE(0)/NONE(0)/NONE(-1)
[NOTICE] (1) : haproxy version is 2.6.7-c55bfdb
[ALERT] (1) : Current worker (8) exited with code 134 (Aborted)
[ALERT] (1) : exit-on-failure: killing every processes with SIGTERM
[WARNING] (1) : All workers exited. Exiting... (134)
It happens almost immediatly after loading netdata, and happens 100% of the time, so it is clear that netdata is doing something that causes HAProxy to fail. Thanksfully, haproxy will restart itself after crashing.
I tried adding nproc
limit to the netdata container, or changing nofile
from 1024 to 2048 or 512, with no success.
I tried adding nbthread 1
to the HAProxy config, and this seems to work, meaning that HAProxy doesn’t crash anymore, probably because you can’t have thread lock issues with only one thread (taps head).
But the graphs still use a lot of CPU during load, takes a long time to load, and will become unavailable pretty quickly.
So IMHO this is really a netdata issue, with way too much CPU used for some reason when the graphs are loaded either through cloud or direct.
I have brought back haproxy config to use 4 threads otherwise users will likely see a perf impact.
I will probably bring this to the attention of HAProxy maintainers at some point, as it is quite reliable crash conditions and probably shouldn’t happen. But at the moment there is still uncertainety as to what exactly is causing the threads to lock and haproxy to crash.
Just wanted to share here in case someone has an idea on what to look for, or why this only happens with netdata simply trying to load the netdata page.