Frequent crashes on netdata parent, high throughput / high child count

Problem/Question

I’ve got a netdata parent servicing 600+ children, frequently the netdata process on the parent is killed with status=6/ABRT:

Jun  3 01:09:57 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT
Jun  3 02:35:28 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT
Jun  3 07:06:20 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT
Jun  3 16:30:05 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT

The process restarts but it takes quite a while for all the clients to reconnect as I have “accept a streaming request every seconds = 1” set (without this setting netdata parent cannot handle the load as the clients attempt to re-connect, probably network tuning could help here but that’s a different topic).
Looking through graphs the parent doesn’t seem to use additional resources prior to failure - CPU, mem, irq are typically all steady. OOMkiller is not getting invoked, and no other kernel or related messages are logged in /var/log/messages at the time of failure.

error.log from time of failure has various messages

2022-06-02 23:32:18: netdata LOG FLOOD PROTECTION too many logs (201 logs in 1 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1199 seconds.
free(): invalid pointer
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-02 23:42:49: go.d INFO: main[main] received terminated signal (15). Terminating...
...
2022-06-03 01:03:30: netdata LOG FLOOD PROTECTION too many logs (201 logs in 2 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1198 seconds.
free(): invalid pointer
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-03 01:09:57: go.d INFO: main[main] received terminated signal (15). Terminating...
2022-06-03 01:10:27: netdata INFO  : MAIN : TIMEZONE: using strftime(): 'UTC'
...
2022-06-03 06:50:12: netdata LOG FLOOD PROTECTION too many logs (201 logs in 1 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1199 seconds.
corrupted double-linked list
2022-06-03 07:06:50: netdata INFO  : MAIN : TIMEZONE: using strftime(): 'UTC'
...
2022-06-03 16:27:49: netdata LOG FLOOD PROTECTION too many logs (201 logs in 1 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1199 seconds.
free(): invalid pointer
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-03 16:30:05: go.d INFO: main[main] received terminated signal (15). Terminating...
2022-06-03 16:30:35: netdata INFO  : MAIN : TIMEZONE: using strftime(): 'UTC'

Relevant docs you followed/actions you took to solve the issue

I’ve followed the official docs to try and set up the parent to handle the load of this many children and exporting to graphite, though I wouldn’t be surprised if I’ve missed something along the way. I’m not sure if dbengine is the best memory mode for a parent of this size, possibly map would be a better option as discussed here: Database | Learn Netdata. This server is dedicated to netdata, so I’m willing to make optimizations if I can support this many children (or potentially twice this many).

Environment/Browser/Agent’s version etc

Amazon Linux 2 (RHEL like, systemd), r5a.2xlarge 8cpu 64gb memory, 100gb EBS gp3 volume (3000 IOPS).
parent: v1.34.1 rpm from packagecloud netdata el/7 repo
children run a number of different versions, from 1.22->1.34.

netdata.conf of parent:

[global]
        memory mode = dbengine
        memory deduplication (ksm) = yes
        process scheduling policy = keep
        cleanup obsolete charts after seconds = 3600
        cleanup orphan hosts after seconds = 3600
        delete obsolete charts files = yes
        delete orphan hosts files = yes
        dbengine multihost disk space = 90000

[web]
        bind to = 0.0.0.0:19999 unix:/var/run/netdata/netdata.sock
        accept a streaming request every seconds = 1
        allow dashboard from = localhost
        allow streaming from = redacted
        allow netdata.conf from = localhost

[statsd]
        enabled = no

[registry]
        enabled = no

[health]
        enabled = no

[ml]
        enabled = no

stream.conf of parent:

[stream]
  enabled = no

[redacted key]
  enabled = yes
  default history = 3600
  memory mode = dbengine
  health enabled by default = no
  health enabled = no
  allow from =

exporting.conf of parent:

[exporting:global]
    enabled = yes
    send configured labels = no
    send automatic labels = no
    update every = 10


[graphite:netdata-graphite]
    enabled = yes
    destination = graphite.example.com
    data source = average
    prefix = netdata
    hostname = netdata
    update every = 10
    buffer on failures = 10
    timeout ms = 20000
    send charts matching = apps.cpu apps.mem apps.swap !apps !apps.* cgroup_* !cpu.* !disk_avgsz.* !disk_backlog.* !disk_iotime.* disk_ops.* !disk_space.reserved_for_root disk_space.* disk_util.* !ip.* !ipv4.* !ipv6.* !net.veth* net.* !netdata.* !services.* !users !users* !users.* disk.* !disk_await.* !disk_inodes.* !disk_mops.* !disk_qops.* !disk_svctm.* !groups !groups* !groups.* mem.* !net_packets.* !netfilter system.cpu system.load system.io system.ram system.swap system.net system.uptime !system.* !*
    send hosts matching = localhost *
    send names instead of ids = yes
    send configured labels = no
    send automatic labels = no

systemd override file:

[Service]
LimitNOFILE=200000
CPUSchedulingPolicy=other
Nice=-1

netdata -W buildinfo:

netdata -W buildinfo
Version: netdata v1.34.1
Configure options:  '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-bundled-libJudy' '--with-bundled-protobuf' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib/pkgconfig:/usr/share/pkgconfig'
Install type: binpkg-rpm
    Binary architecture: x86_64
    Packaging distro:
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (bundled)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

What I expected to happen

Parent netdata continues to run and process doesn’t exit abnormally.

Hello @n0coast,

First of all thank you for taking the time to explain in depth what you have done.

Is it an option to update the parent node to the latest nightly build?

Tasos.

Hello, I think I have the same problem but my setup is a loadbalancer in front of 6 central netdata (around 200 children on each central).

More information here :

I could update to the nightly build, but of course I would prefer to not do that in production long term - if I switch to nightly are there any other changes I should make?

Not at all, you will use the same configuration (/etc/netdata ) (make sure to back it up advance).

I managed to get 2 differents coredump (on 1.34.1) :

PS: I have installed the nightly 1.34.0-279-nightly on one server, and for now I did not get any crash on this one.

Hi guys, we’re checking the dumps from @delaf (many many thanks). We’re trying to reproduce the problem.

I’ve updated to netdata v1.34.0-285-g5902130c6 and the problem is now much worse, netdata will not stay online for more than 15 minutes.

/var/log/messages from time of crash:

Jun  7 16:20:43 localhost kernel: STREAM_RECEIVER[25859]: segfault at ffffffffffffffff ip 0000000000636602 sp 00007fbdb0242930 error 7 in netdata[401000+722000]
Jun  7 16:26:58 localhost NET: dhclient: Locked /run/dhclient/resolv.lock
Jun  7 16:28:52 localhost kernel: STREAM_RECEIVER[27174]: segfault at 20 ip 000000000078320d sp 00007fc21b251840 error 4 in netdata[401000+722000]
Jun  7 16:28:52 localhost kernel: STREAM_RECEIVER[27179]: segfault at 20 ip 000000000078320d sp 00007fc2197af810 error 4
Jun  7 16:28:52 localhost kernel: STREAM_RECEIVER[27171]: segfault at 20 ip 000000000078320d sp 00007fc21cbcc810 error 4 in netdata[401000+722000]
Jun  7 16:28:52 localhost kernel: STREAM_RECEIVER[27170]: segfault at 20 ip 000000000078320d sp 00007fc21d27f840 error 4 in netdata[401000+722000]
Jun  7 16:28:52 localhost kernel: in netdata[401000+722000]
Jun  7 16:28:56 localhost kernel: EBPF SOCKET REA[26680]: segfault at 10 ip 000000000071be5d sp 00007f36709d59d8 error 4 in ebpf.plugin[401000+333000]

/opt/netdata/var/log/netdata from same time:

EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-07 16:28:53: apps.plugin FATAL : apps.plugin : Cannot write to a pipe # : Invalid argument

2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF PROCESS : heartbeat randomness of 477000 is too big for a tick of 200000 - setting it to 78000
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF SOCKET : heartbeat randomness of 384000 is too big for a tick of 2000 - setting it to 0
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF SYNC : heartbeat randomness of 291000 is too big for a tick of 2000 - setting it to 0
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF OOMKILL : thread with task id 26646 finished
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF SHM : heartbeat randomness of 411000 is too big for a tick of 2000 - setting it to 0
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF FD : heartbeat randomness of 339000 is too big for a tick of 2000 - setting it to 0
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF VFS : heartbeat randomness of 321000 is too big for a tick of 50000 - setting it to 21000
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF SOFTIRQ : heartbeat randomness of 261000 is too big for a tick of 1000 - setting it to 0
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF HARDIRQ : heartbeat randomness of 262000 is too big for a tick of 1000 - setting it to 0
2022-06-07 16:28:54: ebpf.plugin INFO  : EBPF MOUNT : thread with task id 26642 finished
2022-06-07 16:28:56: ebpf.plugin INFO  : SYNC KERNEL : thread with task id 26685 finished
2022-06-07 16:28:56: ebpf.plugin INFO  : EBPF SOCKET READ : thread with task id 26680 finished
2022-06-07 16:28:56: ebpf.plugin ERROR : EBPF FD : heartbeat clock: woke up 9051 microseconds later than expected (can be due to system load or the CLOCK_REALTIME set to the future).

Compiled netdata v1.34.0-285-nightly with debugging enabled, then ran with valgrind until it crashed: 0bin - encrypted pastebin

Thanks @n0coast got the trace.

@n0coast , can you please share some part of the error.log before the part of EOF found in spawn pipe. ? Thanks!

1 Like

Fixing this is top priority for the entire team. It’s difficult to reproduce, so prompt replies to what @Manolis_Vasilakis is requesting would be extremely useful (e.g. @n0coast or anyone else who has the same issue).

1 Like

Here is the most recent batch of messages surrounding the EOF found in spawn pipe. log:

2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host31,xx.x.216.78]:34102] : Cleaning up database of host 'ecs-ip-xx-xxx-3-45'...
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host31,[xx.x.216.78]:34102] : Closing SQLite database
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host-ip-xx-xxx-193-40,[xx.xxx.193.40]:54146] : SYSTEM_INFO: free 0xb5bad470
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host-ip-xx-xxx-193-40,[xx.xxx.193.40]:54146] : STREAM host-ip-xx-xxx-193-40 [receive from [xx.xxx.193.40]:54146]: client willing to stream metrics for host 'host-ip-xx-xxx-193-40' with machine_guid '8a5edb72-e581-11ec-9846-0adbeae3bf0f': update every = 1, history = 3996, memory mode = dbengine, health disabled, tags ''
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host-ip-xx-xxx-193-40,[xx.xxx.193.40]:54146] : STREAM host-ip-xx-xxx-193-40 [receive from [xx.xxx.193.40]:54146]: initializing communication...
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host-ip-xx-xxx-193-40,[xx.xxx.193.40]:54146] : STREAM host-ip-xx-xxx-193-40 [receive from [xx.xxx.193.40]:54146]: Netdata is using the stream version 4.
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host-ip-xx-xxx-193-40,[xx.xxx.193.40]:54146] : STREAM host-ip-xx-xxx-193-40 [receive from [xx.xxx.193.40]:54146]: receiving metrics...
2022-06-09 12:48:41: netdata ERROR : STREAM_RECEIVER[host-ip-xx-xxx-193-40,[xx.xxx.193.40]:54146] : Ignoring malformed or empty LABEL command.
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host31,[xx.x.216.78]:34102] : EXIT: removing netdata PID file '/var/run/netdata/netdata.pid'...
2022-06-09 12:48:41: netdata INFO  : STREAM_RECEIVER[host31,[xx.x.216.78]:34102] : EXIT: all done - netdata is now exiting - bye bye...
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-09 12:48:43: go.d INFO: main[main] received terminated signal (15). Terminating...
2022-06-09 12:48:44: ebpf.plugin LOG FLOOD PROTECTION resuming logging from process 'ebpf.plugin' (prevented 4 logs in the last 3600 seconds.
2022-06-09 12:48:44: ebpf.plugin INFO  : SOFTIRQ KERNEL : thread with task id 31925 finished
2022-06-09 12:48:44: ebpf.plugin INFO  : HARDIRQ KERNEL : thread with task id 31929 finished
2022-06-09 12:51:46: netdata INFO  : MAIN : TIMEZONE: using strftime(): 'UTC'
2022-06-09 12:51:46: netdata INFO  : MAIN : TIMEZONE: fixed as 'UTC'
2022-06-09 12:51:46: netdata INFO  : MAIN : SIGNAL: Not enabling reaper
2022-06-09 12:51:46: netdata INFO  : MAIN : resources control: allowed file descriptors: soft = 200000, max = 200000

Sanitizing these logs does take a little time, if you’d like many more entries we’ll need to setup a way I can send them confidentially.

This is on the same EC2 instance I described in the original post, but running netdata v1.34.0-285-nightly that was compiled with debug options.

I’ll try and respond as quickly as possible, and should be available weekday daytime hours PDT.

I’m continuing to run v1.34.0-285-nightly compiled with debug options, and I’ve enabled all the systemd and OS settings to generate a core dump when netdata crashes, but netdata has now been stable for the last 5 days! Metrics collection and exporting looks great, the system is busy running an average cpu utilization of 70% but it’s stable!