Netdata shutdown randomly:

Hello,

My Netdata master server use to store streaming data from ~25 node is stoping randomly on a regular basis:

The netdata version is 1.44.1 running on Ubuntu 22.04

In the journalctl logs, I have some weird message before the shutdown

Jan 22 23:12:40 ns3117200 netdata[3279635]: METRIC: refcount is 0 (zero or negative) during release
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0xc6f69)[0x55a637a35f69]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x37608b)[0x55a637ce508b]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x377045)[0x55a637ce6045]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x36ca1d)[0x55a637cdba1d]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x7b0f6)[0x55a6379ea0f6]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x7b5a3)[0x55a6379ea5a3]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0xd47a5)[0x55a637a437a5]

Here is the full log of the last restart.

Jan 22 23:12:37 ns3117200 netdata[3279635]: DBENGINE: error while reading extent from datafile 1283 of tier 2, at offset 544768 (4350 bytes) expected page (DESCR) from 1705>
Jan 22 23:12:39 ns3117200 netdata[3279635]: DBENGINE: error while reading extent from datafile 11263 of tier 1, at offset 3104768 (6983 bytes) expected page (DESCR) from 17>
Jan 22 23:12:40 ns3117200 netdata[3279635]: METRIC: refcount is 0 (zero or negative) during release
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0xc6f69)[0x55a637a35f69]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x37608b)[0x55a637ce508b]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x377045)[0x55a637ce6045]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x36ca1d)[0x55a637cdba1d]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x7b0f6)[0x55a6379ea0f6]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0x7b5a3)[0x55a6379ea5a3]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /usr/sbin/netdata(+0xd47a5)[0x55a637a437a5]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f023b417ac3]
Jan 22 23:12:41 ns3117200 netdata[3279635]: /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f023b4a9850]
Jan 22 23:12:41 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: initializing shutdown with code 1...
Jan 22 23:12:41 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: next: create shutdown file
Jan 22 23:12:41 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in       0 ms, create shutdown file - next: dbengine exit mode
Jan 22 23:12:41 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in       0 ms, dbengine exit mode - next: close webrtc connections
Jan 22 23:12:41 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in       0 ms, close webrtc connections - next: disable maintenance, new queries, new web requests, new stream>
Jan 22 23:12:41 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in       0 ms, disable maintenance, new queries, new web requests, new streaming connections and aclk - next: >
Jan 22 23:12:41 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 7 services [ WEB_SERVER HEALTH ] to exit: 'HEALTH' (3279842), 'WEB[1]' (3279847), 'WE>
Jan 22 23:12:41 ns3117200 netdata[3279635]: stopped after 8 connects, 8 disconnects (max concurrent 3), 197 receptions and 217 sends
Jan 22 23:12:41 ns3117200 netdata[3279635]: stopped after 12 connects, 12 disconnects (max concurrent 3), 171 receptions and 157 sends
Jan 22 23:12:41 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:41 ns3117200 netdata[3279635]: stopped after 10 connects, 10 disconnects (max concurrent 3), 230 receptions and 217 sends
Jan 22 23:12:41 ns3117200 netdata[3279635]: stopped after 6 connects, 6 disconnects (max concurrent 2), 146 receptions and 119 sends
Jan 22 23:12:41 ns3117200 netdata[3279635]: stopped after 3 connects, 3 disconnects (max concurrent 1), 47 receptions and 38 sends
Jan 22 23:12:41 ns3117200 netdata[3279635]: stopped after 11 connects, 11 disconnects (max concurrent 2), 144 receptions and 111 sends
Jan 22 23:12:41 ns3117200 netdata[3279635]: closing all web server sockets...
Jan 22 23:12:41 ns3117200 netdata[3279635]: all static web threads stopped.
Jan 22 23:12:42 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in      50 ms, stop replication, exporters, health and web servers threads - next: stop collectors and streami>
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', killing data collection child process with pid 3279863
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', stopping plugin thread: plugin:ebpf
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: STATSD: stopping data collection thread 1...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', waiting for data collection child process pid 3279863 to exit...
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', killing data collection child process with pid 3279884
Jan 22 23:12:42 ns3117200 netdata[3279635]: TC: killing with SIGTERM tc-qos-helper process 3279906
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: STATSD: closing sockets...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', killing data collection child process with pid 3279898
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', killing data collection child process with pid 3279888
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', killing data collection child process with pid 3279881
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', waiting for data collection child process pid 3279888 to exit...
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', waiting for data collection child process pid 3279881 to exit...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', waiting for data collection child process pid 3279898 to exit...
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: STATSD: cleanup completed.
Jan 22 23:12:42 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 43 services [ COLLECTORS STREAMING ] to exit:
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'AWS-RQ-WORKER-1_PROD' [receive from [35.181.237.156]:50144]: receive thread ended (task id 3280121)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'GRA-APACHE_PROD' [receive from [162.19.125.36]:53798]: receive thread ended (task id 3280142)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'GRA-RQ-WORKER_PROD' [receive from [162.19.125.38]:57728]: receive thread ended (task id 3280144)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'RBX-RQ-WORKER_PROD' [receive from [57.128.117.173]:53328]: receive thread ended (task id 3280124)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'OPENVPN_2FA' [receive from [51.77.228.61]:34852]: receive thread ended (task id 3280148)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'GRA-MYSQL-MASTER_PROD' [receive from [162.19.125.39]:44214]: receive thread ended (task id 3280419)
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'RBX-MYSQL-MASTER_PROD' [receive from [57.128.117.175]:42322]: receive thread ended (task id 3280143)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'GITLAB' [receive from [176.31.234.171]:53230]: receive thread ended (task id 3280149)
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', killing data collection child process with pid 3279882
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', waiting for data collection child process pid 3279882 to exit...
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', waiting for data collection child process pid 3279884 to exit...
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'AWS-RQ-WORKER-3_PROD' [receive from [35.181.229.18]:51008]: receive thread ended (task id 3280125)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'SBG-MYSQL-REPLICA_PROD' [receive from [162.19.18.195]:57898]: receive thread ended (task id 3280139)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'TEST.IPAIDTHAT.IO_TEST' [receive from [51.68.34.229]:59058]: receive thread ended (task id 3280141)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'QA.IPAIDTHAT.IO_QA' [receive from [135.125.6.109]:49982]: receive thread ended (task id 3280421)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'GRA-REDIS-ELASTIC-MASTER_PROD' [receive from [162.19.125.37]:57272]: receive thread ended (task id 3280420)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'AWS-RQ-WORKER-2_PROD' [receive from [13.37.137.17]:56790]: receive thread ended (task id 3280146)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'CHEF-SERVER-JENKINS' [receive from [51.68.42.175]:54852]: receive thread ended (task id 3280127)
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg="received terminated signal (15). Terminating..." plugin=go.d component=agent
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', stopping plugin thread: plugin:systemd-journal
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: cleanup completed.
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'GRAYLOG-4' [receive from [51.210.188.43]:58022]: receive thread ended (task id 3280130)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'SONARQUBE' [receive from [51.83.15.55]:54534]: receive thread ended (task id 3280244)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'GRA-PROXY_PROD' [receive from [51.75.193.86]:48340]: receive thread ended (task id 3280425)
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg="instance is stopped" plugin=go.d component="discovery manager"
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'STAGE.IPAIDTHAT.IO' [receive from [149.202.70.134]:49000]: receive thread ended (task id 3280422)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'AWS-RQ-WORKER-5_PROD' [receive from [35.181.225.36]:34676]: receive thread ended (task id 3280137)
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', killing data collection child process with pid 3279889
Jan 22 23:12:42 ns3117200 netdata[3279635]: PLUGINSD: 'host:NETDATA-MASTER', waiting for data collection child process pid 3279889 to exit...
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'AWS-RQ-WORKER-4_PROD' [receive from [35.181.242.117]:33850]: receive thread ended (task id 3280519)
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'RBX-REDIS-ELASTIC-REPLICA_PROD' [receive from [57.128.117.172]:48306]: receive thread ended (task id 3280129)
Jan 22 23:12:42 ns3117200 netdata[3279635]: SIGNAL: reap_child(3279863) killed by signal: 15
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=test_ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=qa_ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=devops_ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=gitlab_ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=gitlab_registry_ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=sonarqube_ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=x509check job=demo_ipaidthat_io
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=systemdunits job=service-units
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=chrony job=local
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=httpcheck job=httpcheck
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=logind job=logind
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg=stopped plugin=go.d collector=web_log job=nginx
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg="instance is stopped" plugin=go.d component="job manager"
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg="instance is stopped" plugin=go.d component="discovery file"
Jan 22 23:12:42 ns3117200 netdata[3279635]: STREAM 'RBX_APACHE_PROD' [receive from [57.128.117.174]:37468]: receive thread ended (task id 3280423)
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg="instance is stopped" plugin=go.d component="filestatus manager"
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg="instance is stopped" plugin=go.d component="functions manager"
Jan 22 23:12:42 ns3117200 netdata[3279882]: level=info msg="instance is stopped" plugin=go.d component=agent
Jan 22 23:12:42 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 3 services [ COLLECTORS ] to exit: 'P[cgroups]' (3279866), 'PD[ebpf]' (3279880), 'P[c>
Jan 22 23:12:42 ns3117200 netdata[3279635]: cleaning up...
Jan 22 23:12:42 ns3117200 netdata[3279635]: waiting for discovery thread to finish...
Jan 22 23:12:42 ns3117200 netdata[3279635]: discovery thread stopped
Jan 22 23:12:42 ns3117200 netdata[3279635]: ACLK SYNC: Shutting down ACLK synchronization event loop
Jan 22 23:12:42 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 2 services [ COLLECTORS ] to exit: 'P[cgroups]' (3279866), 'PD[ebpf]' (3279880)
Jan 22 23:12:42 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 1 services [ COLLECTORS ] to exit: 'PD[ebpf]' (3279880)
Jan 22 23:12:43 ns3117200 ebpf.plugin[3279898]: eBPF cannot unload all threads on time, but it will go away
Jan 22 23:12:43 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 1 services [ COLLECTORS ] to exit: 'PD[ebpf]' (3279880)
Jan 22 23:12:44 ns3117200 ebpf.plugin[3279898]: eBPF cannot unload all threads on time, but it will go away
Jan 22 23:12:44 ns3117200 netdata[3279635]: SIGNAL: reap_child(3279884) killed by signal: 15
Jan 22 23:12:44 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in    2058 ms, stop collectors and streaming threads - next: stop replication threads
Jan 22 23:12:44 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 1 services [ REPLICATION ] to exit: 'REPLAY[1]' (3279854)
Jan 22 23:12:44 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in      50 ms, stop replication threads - next: prepare metasync shutdown
Jan 23 00:23:55 ns3117200 netdata[3279635]: SIGNAL: waitid(3295840): failed - it seems the child is already reaped
Jan 23 06:23:55 ns3117200 netdata[3279635]: SIGNAL: waitid(3297607): failed - it seems the child is already reaped

The service netdata status command show the service as running, but the Web UI is not available

root@ns3117200:~# service netdata status
● netdata.service - Real time performance monitoring
     Loaded: loaded (/lib/systemd/system/netdata.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/netdata.service.d
             └─override.conf
     Active: active (running) since Mon 2024-01-22 12:21:49 UTC; 18h ago
   Main PID: 3279635 (netdata)
      Tasks: 63 (limit: 37998)
     Memory: 1.6G
        CPU: 6h 18min 13.558s
     CGroup: /system.slice/netdata.service
             ├─3279635 /usr/sbin/netdata -D -P /var/run/netdata/netdata.pid
             └─3279637 /usr/sbin/netdata --special-spawn-server

Jan 22 23:12:42 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 1 services [ COLLECTORS ] to exit: 'PD[ebpf]' (3279880)
Jan 22 23:12:43 ns3117200 ebpf.plugin[3279898]: eBPF cannot unload all threads on time, but it will go away
Jan 22 23:12:43 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 1 services [ COLLECTORS ] to exit: 'PD[ebpf]' (3279880)
Jan 22 23:12:44 ns3117200 ebpf.plugin[3279898]: eBPF cannot unload all threads on time, but it will go away
Jan 22 23:12:44 ns3117200 netdata[3279635]: SIGNAL: reap_child(3279884) killed by signal: 15
Jan 22 23:12:44 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in    2058 ms, stop collectors and streaming threads - next: stop replication threads
Jan 22 23:12:44 ns3117200 netdata[3279635]: SERVICE CONTROL: waiting for the following 1 services [ REPLICATION ] to exit: 'REPLAY[1]' (3279854)
Jan 22 23:12:44 ns3117200 netdata[3279635]: NETDATA SHUTDOWN: in      50 ms, stop replication threads - next: prepare metasync shutdown
Jan 23 00:23:55 ns3117200 netdata[3279635]: SIGNAL: waitid(3295840): failed - it seems the child is already reaped
Jan 23 06:23:55 ns3117200 netdata[3279635]: SIGNAL: waitid(3297607): failed - it seems the child is already reaped

I have no idea why this is happening, anyone has any idea or advice where to investigate ?

Thanks a lot.

@DeWaRs1206 probably similar issue to the user on this post Netdata service restarting randomly in one of the server. - #7 by hugo

suggestion is to open a bug report including these log details