Problem/Question
I’ve got a netdata parent servicing 600+ children, frequently the netdata process on the parent is killed with status=6/ABRT:
Jun 3 01:09:57 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT
Jun 3 02:35:28 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT
Jun 3 07:06:20 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT
Jun 3 16:30:05 ip-nnn systemd: netdata.service: main process exited, code=killed, status=6/ABRT
The process restarts but it takes quite a while for all the clients to reconnect as I have “accept a streaming request every seconds = 1” set (without this setting netdata parent cannot handle the load as the clients attempt to re-connect, probably network tuning could help here but that’s a different topic).
Looking through graphs the parent doesn’t seem to use additional resources prior to failure - CPU, mem, irq are typically all steady. OOMkiller is not getting invoked, and no other kernel or related messages are logged in /var/log/messages
at the time of failure.
error.log from time of failure has various messages
2022-06-02 23:32:18: netdata LOG FLOOD PROTECTION too many logs (201 logs in 1 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1199 seconds.
free(): invalid pointer
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-02 23:42:49: go.d INFO: main[main] received terminated signal (15). Terminating...
...
2022-06-03 01:03:30: netdata LOG FLOOD PROTECTION too many logs (201 logs in 2 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1198 seconds.
free(): invalid pointer
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-03 01:09:57: go.d INFO: main[main] received terminated signal (15). Terminating...
2022-06-03 01:10:27: netdata INFO : MAIN : TIMEZONE: using strftime(): 'UTC'
...
2022-06-03 06:50:12: netdata LOG FLOOD PROTECTION too many logs (201 logs in 1 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1199 seconds.
corrupted double-linked list
2022-06-03 07:06:50: netdata INFO : MAIN : TIMEZONE: using strftime(): 'UTC'
...
2022-06-03 16:27:49: netdata LOG FLOOD PROTECTION too many logs (201 logs in 1 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1199 seconds.
free(): invalid pointer
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2022-06-03 16:30:05: go.d INFO: main[main] received terminated signal (15). Terminating...
2022-06-03 16:30:35: netdata INFO : MAIN : TIMEZONE: using strftime(): 'UTC'
Relevant docs you followed/actions you took to solve the issue
I’ve followed the official docs to try and set up the parent to handle the load of this many children and exporting to graphite, though I wouldn’t be surprised if I’ve missed something along the way. I’m not sure if dbengine is the best memory mode for a parent of this size, possibly map
would be a better option as discussed here: Database | Learn Netdata. This server is dedicated to netdata, so I’m willing to make optimizations if I can support this many children (or potentially twice this many).
Environment/Browser/Agent’s version etc
Amazon Linux 2 (RHEL like, systemd), r5a.2xlarge 8cpu 64gb memory, 100gb EBS gp3 volume (3000 IOPS).
parent: v1.34.1 rpm from packagecloud netdata el/7 repo
children run a number of different versions, from 1.22->1.34.
netdata.conf of parent:
[global]
memory mode = dbengine
memory deduplication (ksm) = yes
process scheduling policy = keep
cleanup obsolete charts after seconds = 3600
cleanup orphan hosts after seconds = 3600
delete obsolete charts files = yes
delete orphan hosts files = yes
dbengine multihost disk space = 90000
[web]
bind to = 0.0.0.0:19999 unix:/var/run/netdata/netdata.sock
accept a streaming request every seconds = 1
allow dashboard from = localhost
allow streaming from = redacted
allow netdata.conf from = localhost
[statsd]
enabled = no
[registry]
enabled = no
[health]
enabled = no
[ml]
enabled = no
stream.conf of parent:
[stream]
enabled = no
[redacted key]
enabled = yes
default history = 3600
memory mode = dbengine
health enabled by default = no
health enabled = no
allow from =
exporting.conf of parent:
[exporting:global]
enabled = yes
send configured labels = no
send automatic labels = no
update every = 10
[graphite:netdata-graphite]
enabled = yes
destination = graphite.example.com
data source = average
prefix = netdata
hostname = netdata
update every = 10
buffer on failures = 10
timeout ms = 20000
send charts matching = apps.cpu apps.mem apps.swap !apps !apps.* cgroup_* !cpu.* !disk_avgsz.* !disk_backlog.* !disk_iotime.* disk_ops.* !disk_space.reserved_for_root disk_space.* disk_util.* !ip.* !ipv4.* !ipv6.* !net.veth* net.* !netdata.* !services.* !users !users* !users.* disk.* !disk_await.* !disk_inodes.* !disk_mops.* !disk_qops.* !disk_svctm.* !groups !groups* !groups.* mem.* !net_packets.* !netfilter system.cpu system.load system.io system.ram system.swap system.net system.uptime !system.* !*
send hosts matching = localhost *
send names instead of ids = yes
send configured labels = no
send automatic labels = no
systemd override file:
[Service]
LimitNOFILE=200000
CPUSchedulingPolicy=other
Nice=-1
netdata -W buildinfo
:
netdata -W buildinfo
Version: netdata v1.34.1
Configure options: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-bundled-libJudy' '--with-bundled-protobuf' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib/pkgconfig:/usr/share/pkgconfig'
Install type: binpkg-rpm
Binary architecture: x86_64
Packaging distro:
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: YES
ACLK-NG New Cloud Protocol: YES
ACLK Legacy: NO
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: NO
Libraries:
protobuf: YES (bundled)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: YES
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: YES
What I expected to happen
Parent netdata continues to run and process doesn’t exit abnormally.