Regular Crashes and High CPU usage on 1.38.1 Parent

RandomAndrew · May 16, 2023, 5:22pm

Background
We run netdata on around ~800 host, until recently we had them reporting to a single parent running 1.35.1, we been using this version because newer versions have not been stable for us. However recently 1.35.1 began crashing. We were seeing the follow error in our logs:

2023-05-09 19:07:42: netdata FATAL : STREAM_RECEIVER[...]:59739] : Cannot allocate dbengine page cache page, with mmap() # : Cannot allocate memory

We found an issue stating page cache uses malloc = yes in the config, might help resolve this issue. But we saw no improvements. After the config change it was a malloc error instead of mmap.

we are pumping a little over 2 million metrics per second. We decided to split into 2 parents to distribute the load. On the new parent we also upgraded to 1.38.1. We start up faster. But we still see crashes every 2 hours or so on both parents.

For the new parent we have tried upgrading to 1.38 to see if that would improve the current situation. Crashes still occur, however, after ~20 minutes the CPU raises up to 70% to 90% or higher, along with DBengine stats Cache Operations del acquire occurring more frequently. Originally on 1.37 we are only seeing about 20 to 30% cpu usage on the same hardware (Instance size 16v CPU 32 RAM, 100g 12000 IOPS volume).

1.38.1 typical cpu usage

1.38.1 cache purges (purges start up after running 20 minutes)
netdata-nonpord-cache

Since the CPU is running so hot we are running the other parent on 1.37.1 which runs at around 30% cpu all the time. But still crashes on a regular basis.

–logs of crash in 1.37–
corrupted double-linked list
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.

–logs of crash in 1.38—
2023-05-15 18:32:30: netdata INFO : SERVICE : Removing obsolete dimension ‘sent’ (sent) of ‘net_packets.veth9749d0d’ (net_packets.veth9749d0d).
2023-05-15 18:32:30: netdata LOG FLOOD PROTECTION too many logs (201 logs in 340 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process ‘netdata’ for 860 seconds.
2023-05-15 18:35:45: netdata INFO : MAIN : TIMEZONE: using strftime(): ‘UTC’
2023-05-15 18:35:45: netdata INFO : MAIN : TIMEZONE: fixed as ‘UTC’
2023-05-15 18:35:45: netdata INFO : MAIN : NETDATA STARTUP: next: initialize ML

Consistent cpu usage on 1.37.1

Help would be appreciated the frequent crashes, increased cpu, and tuning the parents.
Following is metrics and stats from the parents, their configuration, and logs of the crashes from systemctl and error.logs

dbengine_stats of 1.38.1 parent
		"default_granularity_secs":1,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":4,
		"sizeof_page_data":4096,
		"pages_per_extent":64,
		"datafiles":41,
		"extents":17654,
		"extents_pages":1129856,
		"points":1124989481,
		"metrics":1129690,
		"metrics_pages":1129856,
		"extents_compressed_bytes":782566642,
		"pages_uncompressed_bytes":4499957924,
		"pages_duration_secs":1144164696,
		"single_point_pages":128,
		"first_t":1684175700,
		"last_t":5940368799,
		"database_retention_secs":4256193099,
		"average_compression_savings":82.61,
		"average_point_duration_secs":1.02,
		"average_metric_retention_secs":1012.81,
		"ephemeral_metrics_per_day_percent":inf,
		"average_page_size_bytes":3982.77,
		"estimated_concurrently_collected_metrics":0,
		"currently_collected_metrics":593784,
		"disk_space":1012927092,
		"max_disk_space":1048576000

Dbengine_stats of the 1.37.1 parent
	"tier0": {
		"default_granularity_secs":1,
		"sizeof_metric":144,
		"sizeof_metric_in_index":40,
		"sizeof_page":64,
		"sizeof_page_in_index":24,
		"sizeof_extent":32,
		"sizeof_page_in_extent":8,
		"sizeof_datafile":96,
		"sizeof_page_in_cache":144,
		"sizeof_point_data":4,
		"sizeof_page_data":4096,
		"pages_per_extent":64,
		"datafiles":4,
		"extents":107436,
		"extents_pages":6859484,
		"points":7013132213,
		"metrics":1472901,
		"metrics_pages":8323111,
		"extents_compressed_bytes":3944878800,
		"pages_uncompressed_bytes":28052528852,
		"pages_duration_secs":7041223291,
		"single_point_pages":906,
		"first_t":1684161305,
		"last_t":1684187872,
		"database_retention_secs":26567,
		"average_compression_savings":85.94,
		"average_point_duration_secs":1.00,
		"average_metric_retention_secs":4780.51,
		"ephemeral_metrics_per_day_percent":1482.12,
		"average_page_size_bytes":4089.60,
		"estimated_concurrently_collected_metrics":265036,
		"currently_collected_metrics":1463627,
		"max_concurrently_collected_metrics":1463627,
		"disk_space":4591628288,
		"max_disk_space":52428800000

BUILD INFO


Version: netdata v1.38.1
Configure options:  '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-bundled-protobuf' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--disable-dependency-tracking' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib/pkgconfig:/usr/share/pkgconfig'
Install type: binpkg-rpm
    Binary architecture: x86_64
    Packaging distro:  
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (bundled)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES
Debug/Developer Features:
    Trace Allocations:       NO

1.37.1 build info
Version: netdata v1.37.1
Configure options:  '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-bundled-protobuf' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--disable-dependency-tracking' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib/pkgconfig:/usr/share/pkgconfig'
Install type: binpkg-rpm
    Binary architecture: x86_64
    Packaging distro:  
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (bundled)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES
Debug/Developer Features:
    Trace Allocations:       NO

Configuration


[global]
        process scheduling policy = keep
        update every = 1

[db]
        mode = dbengine
        storage tiers = 1
        dbengine multihost disk space MB = 1000
        dbengine page cache size MB = 32
        cleanup obsolete charts after secs = 3600
        cleanup orphan hosts after secs = 3600
        delete obsolete charts files = yes
        delete orphan hosts files = yes
        memory deduplication (ksm) = yes
        update every = 1
[plugins]
        enable running new plugins = no

[web]
        bind to = 0.0.0.0:19999 unix:/var/run/netdata/netdata.sock
        #accept a streaming request every seconds = 1
        allow dashboard from = localhost
        allow streaming from = [removed ip list]
        allow netdata.conf from = localhost

[statsd]
        enabled = no

[registry]
        enabled = no

[health]
        enabled = no

[ml]
        enabled = no

[exporting:global]
    enabled = yes
    send configured labels = no
    send automatic labels = no
    update every = 10


[graphite:netdata-graphite]
    enabled = yes
    destination = netdata-graphite.example.com "changed for privacy"
    data source = average
    prefix = netdata
    hostname = nonprod-netdata
    update every = 10
    buffer on failures = 10
    timeout ms = 20000
    send charts matching = apps.cpu apps.mem apps.swap !apps !apps.* !cgroup_*newrelic* cgroup_* !cpu.* !disk_avgsz.* !disk_backlog.* !disk_iotime.* disk_ops.* !disk_space.reserved_for_root disk_space.* disk_util.* !ip.* !ipv4.* !ipv6.* !net.veth* net.* !netdata.* !services.* !users !users* !users.* disk.* !disk_await.* !disk_inodes.* !disk_mops.* !disk_qops.* !disk_svctm.* !groups !groups* !groups.* mem.* !net_packets.* !netfilter system.cpu system.load system.io system.ram system.swap system.net system.uptime !system.* !*
    send hosts matching = localhost *
    send names instead of ids = yes
    send configured labels = no
    send automatic labels = no

Current we have stream request every second off to just help with recovering from frequent crashes

Error logs at time of crash
From systemctl we see 2 different type of crashes occurring abrupt leaves and segfaults


1.38 host systemctl logs
May 15 15:24:39 host[1]: netdata.service: main process exited, code=killed, status=11/SEGV
May 15 15:24:39 host[1]: Unit netdata.service entered failed state.
May 15 15:24:39 host[1]: netdata.service failed.
May 15 15:25:09 host[1]: netdata.service holdoff time over, scheduling restart.
May 15 15:25:09 host[1]: Stopped Real time performance monitoring.
May 15 15:25:09 host[1]: Starting Real time performance monitoring...
May 15 15:25:09 host[1]: Started Real time performance monitoring.
May 15 18:35:15 host[1]: netdata.service: main process exited, code=killed, status=6/ABRT
May 15 18:35:15 host[1]: Unit netdata.service entered failed state.
May 15 18:35:15 host[1]: netdata.service failed.
May 15 18:35:45 host[1]: netdata.service holdoff

–logs of the seg fault—

2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'received' (received) of 'net.veth692e347' (net.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'sent' (sent) of 'net.veth692e347' (net.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'up' (up) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'down' (down) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'notpresent' (notpresent) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'lowerlayerdown' (lowerlayerdown) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'testing' (testing) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'dormant' (dormant) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'unknown' (unknown) of 'net_operstate.veth692e347' (net_operstate.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'up' (up) of 'net_carrier.veth692e347' (net_carrier.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'down' (down) of 'net_carrier.veth692e347' (net_carrier.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'mtu' (mtu) of 'net_mtu.veth692e347' (net_mtu.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'received' (received) of 'net_packets.veth692e347' (net_packets.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'sent' (sent) of 'net_packets.veth692e347' (net_packets.veth692e347).
2023-05-15 15:24:10: netdata INFO  : SERVICE : Removing obsolete dimension 'multicast' (multicast) of 'net_packets.veth692e347' (net_packets.veth692e347).
2023-05-15 15:24:26: netdata ERROR : WEB[1] : STREAM ‘host-replaced' [receive from [xx.xxx.xxx.xx]:55792]: thread 1466 takes too long to stop, giving up...
2023-05-15 15:24:26: netdata INFO  : WEB[1] : STREAM 'host-replaced’ [receive from [xx.xxx.xxx.xx]:39350]: multiple connections for same host, old connection was used 1584 secs ago (signaled old receiver to stop). STATUS: ALREADY CONNECTED
2023-05-15 15:24:32: netdata INFO  : EXPORTING : enabled exporting of host ‘host-replace’ for instance 'netdata-graphite'
2023-05-15 15:25:09: netdata INFO  : MAIN : TIMEZONE: using strftime(): 'UTC'
2023-05-15 15:25:09: netdata INFO  : MAIN : TIMEZONE: fixed as 'UTC'
2023-05-15 15:25:09: netdata INFO  : MAIN : NETDATA STARTUP: next: initialize ML
2023-05-15 15:25:09: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize ML - next: initialize signals
2023-05-15 15:25:09: netdata INFO  : MAIN : SIGNAL: Not enabling reaper
2023-05-15 15:25:09: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize signals - next: initialize static threads
2023-05-15 15:25:09: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize static threads - next: initialize web server
2023-05-15 15:25:09: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize web server - next: set resource limits
2023-05-15 15:25:09: netdata INFO  : MAIN : resources control: allowed file descriptors: soft = 200000, max = 200000
2023-05-15 15:25:09: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, set resource limits - next: become daemon

—Logs of status=6/ABRT, flood protection kicked in–

023-05-15 18:32:30: netdata INFO  : SERVICE : Removing obsolete dimension 'up' (up) of 'net_carrier.veth9749d0d' (net_carrier.veth9749d0d).
2023-05-15 18:32:30: netdata INFO  : SERVICE : Removing obsolete dimension 'down' (down) of 'net_carrier.veth9749d0d' (net_carrier.veth9749d0d).
2023-05-15 18:32:30: netdata INFO  : SERVICE : Removing obsolete dimension 'mtu' (mtu) of 'net_mtu.veth9749d0d' (net_mtu.veth9749d0d).
2023-05-15 18:32:30: netdata INFO  : SERVICE : Removing obsolete dimension 'received' (received) of 'net_packets.veth9749d0d' (net_packets.veth9749d0d).
2023-05-15 18:32:30: netdata INFO  : SERVICE : Removing obsolete dimension 'sent' (sent) of 'net_packets.veth9749d0d' (net_packets.veth9749d0d).
2023-05-15 18:32:30: netdata LOG FLOOD PROTECTION too many logs (201 logs in 340 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 860 seconds.
2023-05-15 18:35:45: netdata INFO  : MAIN : TIMEZONE: using strftime(): 'UTC'
2023-05-15 18:35:45: netdata INFO  : MAIN : TIMEZONE: fixed as 'UTC'
2023-05-15 18:35:45: netdata INFO  : MAIN : NETDATA STARTUP: next: initialize ML
2023-05-15 18:35:45: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize ML - next: initialize signals
2023-05-15 18:35:45: netdata INFO  : MAIN : SIGNAL: Not enabling reaper
2023-05-15 18:35:45: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize signals - next: initialize static threads
2023-05-15 18:35:45: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize static threads - next: initialize web server
2023-05-15 18:35:45: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize web server - next: set resource limits
2023-05-15 18:35:45: netdata INFO  : MAIN : resources co

1.37 --logs – ABRT–
2

023-05-15 21:21:10: netdata INFO  : SERVICE : Removing obsolete dimension 'up' (up) of 'net_operstate.vethd1ffa06' (net_operstate.vethd1ffa06).
2023-05-15 21:21:10: netdata INFO  : SERVICE : Removing obsolete dimension 'down' (down) of 'net_operstate.vethd1ffa06' (net_operstate.vethd1ffa06).
2023-05-15 21:21:10: netdata LOG FLOOD PROTECTION too many logs (201 logs in 259 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 941 seconds.
corrupted double-linked list
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2023-05-15 21:24:06: netdata INFO  : MAIN : TIMEZONE: using strftime(): 'UTC'

RandomAndrew · May 17, 2023, 6:25pm

We moved the parent 1.38.1 host to 1.39.0 and we saw CPU average drop to 12%. so still observing the instance for improvement,

We are still getting crashes on 1.39.0. but we’ve tested altering cleanup obsolete charts and oprhan host as the error logs happen around times of cleanup activities. when we increased this time to a 12hours we didn’t observed any crashes over night. this morning we tried a test with a quicker time (10 minutes) and the instance quickly crashed in 20 minutes. So observing these changes

Manolis_Vasilakis · May 18, 2023, 9:05am

Hi @RandomAndrew

Thank you for the report. We will start investigating.

RandomAndrew · May 18, 2023, 8:25pm

Over the previous day we ran both parents on 1.39.0. CPU usage on both host was low and steady at 12/20%
No crashes for either host with cleanup obsolete charts and set cleanup orphan hosts set to 24 hours. We had one issue(#1) listed below on this version, We decided to upgrade to 1.39.1 this morning and encounter a second issue(#2) listed further below.

cleanup obsolete charts after secs = 86400
cleanup orphan hosts after secs = 86400

Issue 1: on 1.39.0
One issue we observed on the parent serving more host (~480), was that every 10 minutes the instance would have delayed exporting to graphite. looking at netdata web-ui it occurring during cache purge activities, and in error.log lots to be creating journal files

example logs we would see around these 10 minute points

2023-05-17 19:33:40: netdata INFO  : LIBUV_WORKER : DBENGINE: creating new data and journal files in path /var/cache/netdata/dbengine
2023-05-17 19:33:40: netdata INFO  : LIBUV_WORKER : DBENGINE: created data file "/var/cache/netdata/dbengine/datafile-1-0000000007.ndf".
2023-05-17 19:33:40: netdata INFO  : LIBUV_WORKER : DBENGINE: created journal file "/var/cache/netdata/dbengine/journalfile-1-0000000007.njf".

example gap in graphite

This morning we deployed both parents on version 1.39.1,
the parent for ~180 host so far has preform similarly to 1.39.0
Issue 2: on 1.39.1
the parent for the 480 host after ~hour we started to see exported metrics fluctuate between 10 seconds and 20 seconds. which was not happening on 1.39.0. Unsure what tuning might be needed to fix this issue and get a steady resolution of metrics exported

We have not tested moving back the cleanup times on 1.39.1
dbengine stats of the main parent serving 480 host

{
	"tier0": {
		"default_granularity_secs":1,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":4,
		"sizeof_page_data":4096,
		"pages_per_extent":64,
		"datafiles":11,
		"extents":133815,
		"extents_pages":8564160,
		"points":8034337087,
		"metrics":7644540,
		"metrics_pages":8564160,
		"extents_compressed_bytes":4567179904,
		"pages_uncompressed_bytes":32137348348,
		"pages_duration_secs":8095233088,
		"single_point_pages":534,
		"first_t":1684392890,
		"last_t":1684439473,
		"database_retention_secs":46583,
		"average_compression_savings":85.79,
		"average_point_duration_secs":1.01,
		"average_metric_retention_secs":1058.96,
		"ephemeral_metrics_per_day_percent":7973.54,
		"average_page_size_bytes":3752.54,
		"estimated_concurrently_collected_metrics":173780,
		"currently_collected_metrics":1501180,
		"disk_space":6881233996,
		"max_disk_space":52428800000
	}
}

dbegine stats of the parent serving 180 host

{
	"tier0": {
		"default_granularity_secs":1,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":4,
		"sizeof_page_data":4096,
		"pages_per_extent":64,
		"datafiles":8,
		"extents":87168,
		"extents_pages":5578752,
		"points":5371056207,
		"metrics":4132316,
		"metrics_pages":5578752,
		"extents_compressed_bytes":3579563902,
		"pages_uncompressed_bytes":21484224828,
		"pages_duration_secs":5615269603,
		"single_point_pages":226,
		"first_t":1684389470,
		"last_t":1684439735,
		"database_retention_secs":50265,
		"average_compression_savings":83.34,
		"average_point_duration_secs":1.05,
		"average_metric_retention_secs":1358.87,
		"ephemeral_metrics_per_day_percent":6186.36,
		"average_page_size_bytes":3851.08,
		"estimated_concurrently_collected_metrics":111713,
		"currently_collected_metrics":663973,
		"disk_space":4993923908,
		"max_disk_space":52428800000
	}
}

RandomAndrew · May 19, 2023, 12:11am

We reverted the parent back to 1.39.0, to alleviate the exporting issues. but today we wanted to change the instance type to intel (C6i EC2) from amd. but we encounter a similar issue as the 1.39.1 version with the exporting only working every 20 seconds, it would occasionally improve but deteriorate soon after. the above screenshot shows the sporadic data while on the intel chip(16:26 to 34) and then when we reverted back to amd (16:36). these two time segments are both the same version on 1.39.0

we tried a few restarts, along with attempting to change the cpu scheduling policy to RR etc. but no impact

view of when we had netdata on an intel chip, sporadic exporting

the smaller parent has been on an intel chip without the same issues. It export to the same graphite instance without any issue, so we are confused as to why the other parent would have a dramatic change just swapping instance types

typically what we are seeing on exporting metrics is cpu usage every other 10 second run, but the following export will have no cpu usage at all, just a system/user dot.
example of what it looks like

Manolis_Vasilakis · May 22, 2023, 11:30am

Hi @RandomAndrew

Is the memory mode for children (in stream.conf) set to dbengine?

Jeremiah_Pierucci · May 22, 2023, 4:01pm

Hi. I’m working with @RandomAndrew on this.

Yes, it looks like this:

[HIDDEN]
enabled = yes
default history = 3600
memory mode = dbengine
health enabled by default = no
health enabled = no
allow from = HIDDEN

RandomAndrew · June 8, 2023, 8:51pm

Hi @Manolis_Vasilakis

still seeing issues as listed above, I was wondering if you had found anything during investigation

the crashes from clean up of obsolete host/charts still occur every 24 hours, this is mostly mitigated
every 10 minutes the exporting iteration can skip, seems to relate to the LIBUV_WORKER updating journal files
exporting to netdata goes from every 10 to every 20 seconds when enough host are streaming to the parent, exporting of host seem skipping an iteration

When these skips occurs (in either scenario) we are not seeing exporting failures under netdata.exporting_netdata-graphite_ops

when looking through export.c from my understanding instances shouldn’t skip scheduling for an iteration as long as their individual host update every is greater than the parents update.
And the exporting thread would throw an error if it was not waking up in an expected time; which we do not see any errors matching the state in logs

The iteration also don’t seem to last for more than 3 seconds, so the next thread heartbeast is being schedule and occurring on time, but there littler activity at that time suggesting there possibly no host scheduled in the buffer

Topic		Replies	Views
Frequent crashes on netdata parent, high throughput / high child count Help agent , dbengine , configuration	13	1504	June 16, 2022
insane netdata memory usage Help agent	29	3810	February 9, 2023
Segfaults all of a sudden Help agent	14	751	October 18, 2023
dbengine directory growing despite configured disk usage limits Help	10	1470	February 2, 2023
Children disappearing and reappearing in parent node Help agent , dbengine , streaming	4	411	October 2, 2023

Regular Crashes and High CPU usage on 1.38.1 Parent

Related topics