Background
We run netdata on around ~800 host, until recently we had them reporting to a single parent running 1.35.1, we been using this version because newer versions have not been stable for us. However recently 1.35.1 began crashing. We were seeing the follow error in our logs:
2023-05-09 19:07:42: netdata FATAL : STREAM_RECEIVER[...]:59739] : Cannot allocate dbengine page cache page, with mmap() # : Cannot allocate memory
We found an issue stating page cache uses malloc = yes in the config, might help resolve this issue. But we saw no improvements. After the config change it was a malloc error instead of mmap.
we are pumping a little over 2 million metrics per second. We decided to split into 2 parents to distribute the load. On the new parent we also upgraded to 1.38.1. We start up faster. But we still see crashes every 2 hours or so on both parents.
For the new parent we have tried upgrading to 1.38 to see if that would improve the current situation. Crashes still occur, however, after ~20 minutes the CPU raises up to 70% to 90% or higher, along with DBengine stats Cache Operations del acquire occurring more frequently. Originally on 1.37 we are only seeing about 20 to 30% cpu usage on the same hardware (Instance size 16v CPU 32 RAM, 100g 12000 IOPS volume).
1.38.1 typical cpu usage
1.38.1 cache purges (purges start up after running 20 minutes)
Since the CPU is running so hot we are running the other parent on 1.37.1 which runs at around 30% cpu all the time. But still crashes on a regular basis.
–logs of crash in 1.37–
corrupted double-linked list
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
–logs of crash in 1.38—
2023-05-15 18:32:30: netdata INFO : SERVICE : Removing obsolete dimension ‘sent’ (sent) of ‘net_packets.veth9749d0d’ (net_packets.veth9749d0d).
2023-05-15 18:32:30: netdata LOG FLOOD PROTECTION too many logs (201 logs in 340 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process ‘netdata’ for 860 seconds.
2023-05-15 18:35:45: netdata INFO : MAIN : TIMEZONE: using strftime(): ‘UTC’
2023-05-15 18:35:45: netdata INFO : MAIN : TIMEZONE: fixed as ‘UTC’
2023-05-15 18:35:45: netdata INFO : MAIN : NETDATA STARTUP: next: initialize ML
Consistent cpu usage on 1.37.1
Help would be appreciated the frequent crashes, increased cpu, and tuning the parents.
Following is metrics and stats from the parents, their configuration, and logs of the crashes from systemctl and error.logs
dbengine_stats of 1.38.1 parent
"default_granularity_secs":1,
"sizeof_datafile":288,
"sizeof_page_in_cache":0,
"sizeof_point_data":4,
"sizeof_page_data":4096,
"pages_per_extent":64,
"datafiles":41,
"extents":17654,
"extents_pages":1129856,
"points":1124989481,
"metrics":1129690,
"metrics_pages":1129856,
"extents_compressed_bytes":782566642,
"pages_uncompressed_bytes":4499957924,
"pages_duration_secs":1144164696,
"single_point_pages":128,
"first_t":1684175700,
"last_t":5940368799,
"database_retention_secs":4256193099,
"average_compression_savings":82.61,
"average_point_duration_secs":1.02,
"average_metric_retention_secs":1012.81,
"ephemeral_metrics_per_day_percent":inf,
"average_page_size_bytes":3982.77,
"estimated_concurrently_collected_metrics":0,
"currently_collected_metrics":593784,
"disk_space":1012927092,
"max_disk_space":1048576000
Dbengine_stats of the 1.37.1 parent
"tier0": {
"default_granularity_secs":1,
"sizeof_metric":144,
"sizeof_metric_in_index":40,
"sizeof_page":64,
"sizeof_page_in_index":24,
"sizeof_extent":32,
"sizeof_page_in_extent":8,
"sizeof_datafile":96,
"sizeof_page_in_cache":144,
"sizeof_point_data":4,
"sizeof_page_data":4096,
"pages_per_extent":64,
"datafiles":4,
"extents":107436,
"extents_pages":6859484,
"points":7013132213,
"metrics":1472901,
"metrics_pages":8323111,
"extents_compressed_bytes":3944878800,
"pages_uncompressed_bytes":28052528852,
"pages_duration_secs":7041223291,
"single_point_pages":906,
"first_t":1684161305,
"last_t":1684187872,
"database_retention_secs":26567,
"average_compression_savings":85.94,
"average_point_duration_secs":1.00,
"average_metric_retention_secs":4780.51,
"ephemeral_metrics_per_day_percent":1482.12,
"average_page_size_bytes":4089.60,
"estimated_concurrently_collected_metrics":265036,
"currently_collected_metrics":1463627,
"max_concurrently_collected_metrics":1463627,
"disk_space":4591628288,
"max_disk_space":52428800000
BUILD INFO
Version: netdata v1.38.1
Configure options: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-bundled-protobuf' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--disable-dependency-tracking' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib/pkgconfig:/usr/share/pkgconfig'
Install type: binpkg-rpm
Binary architecture: x86_64
Packaging distro:
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK: YES
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: NO
Libraries:
protobuf: YES (bundled)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: YES
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: YES
Debug/Developer Features:
Trace Allocations: NO
1.37.1 build info
Version: netdata v1.37.1
Configure options: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-bundled-protobuf' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--disable-dependency-tracking' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib/pkgconfig:/usr/share/pkgconfig'
Install type: binpkg-rpm
Binary architecture: x86_64
Packaging distro:
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK: YES
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: NO
Libraries:
protobuf: YES (bundled)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: YES
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: YES
Debug/Developer Features:
Trace Allocations: NO
Configuration
[global]
process scheduling policy = keep
update every = 1
[db]
mode = dbengine
storage tiers = 1
dbengine multihost disk space MB = 1000
dbengine page cache size MB = 32
cleanup obsolete charts after secs = 3600
cleanup orphan hosts after secs = 3600
delete obsolete charts files = yes
delete orphan hosts files = yes
memory deduplication (ksm) = yes
update every = 1
[plugins]
enable running new plugins = no
[web]
bind to = 0.0.0.0:19999 unix:/var/run/netdata/netdata.sock
#accept a streaming request every seconds = 1
allow dashboard from = localhost
allow streaming from = [removed ip list]
allow netdata.conf from = localhost
[statsd]
enabled = no
[registry]
enabled = no
[health]
enabled = no
[ml]
enabled = no
[exporting:global]
enabled = yes
send configured labels = no
send automatic labels = no
update every = 10
[graphite:netdata-graphite]
enabled = yes
destination = netdata-graphite.example.com "changed for privacy"
data source = average
prefix = netdata
hostname = nonprod-netdata
update every = 10
buffer on failures = 10
timeout ms = 20000
send charts matching = apps.cpu apps.mem apps.swap !apps !apps.* !cgroup_*newrelic* cgroup_* !cpu.* !disk_avgsz.* !disk_backlog.* !disk_iotime.* disk_ops.* !disk_space.reserved_for_root disk_space.* disk_util.* !ip.* !ipv4.* !ipv6.* !net.veth* net.* !netdata.* !services.* !users !users* !users.* disk.* !disk_await.* !disk_inodes.* !disk_mops.* !disk_qops.* !disk_svctm.* !groups !groups* !groups.* mem.* !net_packets.* !netfilter system.cpu system.load system.io system.ram system.swap system.net system.uptime !system.* !*
send hosts matching = localhost *
send names instead of ids = yes
send configured labels = no
send automatic labels = no
Current we have stream request every second off to just help with recovering from frequent crashes
Error logs at time of crash
From systemctl we see 2 different type of crashes occurring abrupt leaves and segfaults
1.38 host systemctl logs
May 15 15:24:39 host[1]: netdata.service: main process exited, code=killed, status=11/SEGV
May 15 15:24:39 host[1]: Unit netdata.service entered failed state.
May 15 15:24:39 host[1]: netdata.service failed.
May 15 15:25:09 host[1]: netdata.service holdoff time over, scheduling restart.
May 15 15:25:09 host[1]: Stopped Real time performance monitoring.
May 15 15:25:09 host[1]: Starting Real time performance monitoring...
May 15 15:25:09 host[1]: Started Real time performance monitoring.
May 15 18:35:15 host[1]: netdata.service: main process exited, code=killed, status=6/ABRT
May 15 18:35:15 host[1]: Unit netdata.service entered failed state.
May 15 18:35:15 host[1]: netdata.service failed.
May 15 18:35:45 host[1]: netdata.service holdoff
–logs of the seg fault—
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'received' (received) of 'net.veth692e347' (net.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'sent' (sent) of 'net.veth692e347' (net.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'up' (up) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'down' (down) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'notpresent' (notpresent) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'lowerlayerdown' (lowerlayerdown) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'testing' (testing) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'dormant' (dormant) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'unknown' (unknown) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'up' (up) of 'net_carrier.veth692e347' (net_carrier.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'down' (down) of 'net_carrier.veth692e347' (net_carrier.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'mtu' (mtu) of 'net_mtu.veth692e347' (net_mtu.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'received' (received) of 'net_packets.veth692e347' (net_packets.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'sent' (sent) of 'net_packets.veth692e347' (net_packets.veth692e347).
2023-05-15 15:24:10: netdata INFO : SERVICE : Removing obsolete dimension 'multicast' (multicast) of 'net_packets.veth692e347' (net_packets.veth692e347).
2023-05-15 15:24:26: netdata ERROR : WEB[1] : STREAM ‘host-replaced' [receive from [xx.xxx.xxx.xx]:55792]: thread 1466 takes too long to stop, giving up...
2023-05-15 15:24:26: netdata INFO : WEB[1] : STREAM 'host-replaced’ [receive from [xx.xxx.xxx.xx]:39350]: multiple connections for same host, old connection was used 1584 secs ago (signaled old receiver to stop). STATUS: ALREADY CONNECTED
2023-05-15 15:24:32: netdata INFO : EXPORTING : enabled exporting of host ‘host-replace’ for instance 'netdata-graphite'
2023-05-15 15:25:09: netdata INFO : MAIN : TIMEZONE: using strftime(): 'UTC'
2023-05-15 15:25:09: netdata INFO : MAIN : TIMEZONE: fixed as 'UTC'
2023-05-15 15:25:09: netdata INFO : MAIN : NETDATA STARTUP: next: initialize ML
2023-05-15 15:25:09: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize ML - next: initialize signals
2023-05-15 15:25:09: netdata INFO : MAIN : SIGNAL: Not enabling reaper
2023-05-15 15:25:09: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize signals - next: initialize static threads
2023-05-15 15:25:09: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize static threads - next: initialize web server
2023-05-15 15:25:09: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize web server - next: set resource limits
2023-05-15 15:25:09: netdata INFO : MAIN : resources control: allowed file descriptors: soft = 200000, max = 200000
2023-05-15 15:25:09: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, set resource limits - next: become daemon
—Logs of status=6/ABRT, flood protection kicked in–
023-05-15 18:32:30: netdata INFO : SERVICE : Removing obsolete dimension 'up' (up) of 'net_carrier.veth9749d0d' (net_carrier.veth9749d0d).
2023-05-15 18:32:30: netdata INFO : SERVICE : Removing obsolete dimension 'down' (down) of 'net_carrier.veth9749d0d' (net_carrier.veth9749d0d).
2023-05-15 18:32:30: netdata INFO : SERVICE : Removing obsolete dimension 'mtu' (mtu) of 'net_mtu.veth9749d0d' (net_mtu.veth9749d0d).
2023-05-15 18:32:30: netdata INFO : SERVICE : Removing obsolete dimension 'received' (received) of 'net_packets.veth9749d0d' (net_packets.veth9749d0d).
2023-05-15 18:32:30: netdata INFO : SERVICE : Removing obsolete dimension 'sent' (sent) of 'net_packets.veth9749d0d' (net_packets.veth9749d0d).
2023-05-15 18:32:30: netdata LOG FLOOD PROTECTION too many logs (201 logs in 340 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 860 seconds.
2023-05-15 18:35:45: netdata INFO : MAIN : TIMEZONE: using strftime(): 'UTC'
2023-05-15 18:35:45: netdata INFO : MAIN : TIMEZONE: fixed as 'UTC'
2023-05-15 18:35:45: netdata INFO : MAIN : NETDATA STARTUP: next: initialize ML
2023-05-15 18:35:45: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize ML - next: initialize signals
2023-05-15 18:35:45: netdata INFO : MAIN : SIGNAL: Not enabling reaper
2023-05-15 18:35:45: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize signals - next: initialize static threads
2023-05-15 18:35:45: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize static threads - next: initialize web server
2023-05-15 18:35:45: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize web server - next: set resource limits
2023-05-15 18:35:45: netdata INFO : MAIN : resources co
1.37 --logs – ABRT–
2
023-05-15 21:21:10: netdata INFO : SERVICE : Removing obsolete dimension 'up' (up) of 'net_operstate.vethd1ffa06' (net_operstate.vethd1ffa06).
2023-05-15 21:21:10: netdata INFO : SERVICE : Removing obsolete dimension 'down' (down) of 'net_operstate.vethd1ffa06' (net_operstate.vethd1ffa06).
2023-05-15 21:21:10: netdata LOG FLOOD PROTECTION too many logs (201 logs in 259 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 941 seconds.
corrupted double-linked list
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2023-05-15 21:24:06: netdata INFO : MAIN : TIMEZONE: using strftime(): 'UTC'