Netdata restarting randomly

Hello,

I have an issue with my Netdata master which is restarting randomly on regular basis

In the syslog, I can see

Jan 11 20:41:55 ns3117200 systemd[1]: apt-news.service: Deactivated successfully.
Jan 11 20:41:55 ns3117200 systemd[1]: Finished Update APT News.
Jan 11 20:41:55 ns3117200 named[1035]: address not available resolving 'esm.ubuntu.com/A/IN': 2620:2d:4000:1::44#53
Jan 11 20:41:56 ns3117200 systemd[1]: esm-cache.service: Deactivated successfully.
Jan 11 20:41:56 ns3117200 systemd[1]: Finished Update the local ESM caches.
Jan 11 20:41:57 ns3117200 systemd[1]: Stopping Real time performance monitoring...
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:38::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:38::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:34::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:34::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:36::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:36::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:32::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:32::a#53
Jan 11 20:42:13 ns3117200 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Jan 11 20:42:25 ns3117200 systemd[1]: netdata.service: Deactivated successfully.
Jan 11 20:42:25 ns3117200 systemd[1]: Stopped Real time performance monitoring.
Jan 11 20:42:25 ns3117200 systemd[1]: netdata.service: Consumed 2h 51min 22.004s CPU time.
Jan 11 20:42:25 ns3117200 systemd[1]: Started Real time performance monitoring.

In journalctl logs:

Jan 11 20:41:57 ns3117200 netdata[1969387]: SIGNAL: Received SIGTERM. Cleaning up to exit...
Jan 11 20:41:57 ns3117200 netdata[1969387]: Shutting down command server.
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="received terminated signal (15). Terminating..." plugin=go.d component=agent
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="discovery manager"
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=logind job=logind
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=web_log job=nginx
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=httpcheck job=httpcheck
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=test_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=qa_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=devops_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=gitlab_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=gitlab_registry_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=sonarqube_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=demo_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="discovery file"
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="functions manager"
Jan 11 20:41:57 ns3117200 netdata[1969387]: Shutting down command event loop.
Jan 11 20:41:57 ns3117200 netdata[1969387]: Shutting down command loop complete.
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969790 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="filestatus manager"
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969775 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: Command server has stopped.
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: initializing shutdown with code 0...
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 2021784 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: Cannot waitid() for pid 2023262
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: next: create shutdown file
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in       0 ms, create shutdown file - next: dbengine exit mode
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in       0 ms, dbengine exit mode - next: close webrtc connections
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in       0 ms, close webrtc connections - next: disable maintenance, new queries, new web requests, new stream>
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in       0 ms, disable maintenance, new queries, new web requests, new streaming connections and aclk - next: >
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969792 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969804 killed by SIGTERM

My netdata.conf

[global]
  hostname = NETDATA-MASTER
  enabled = yes
[db]
  mode = dbengine
  update every = 1
  storage tiers = 3
  dbengine multihost disk space MB = 1100
  dbengine tier 1 multihost disk space MB = 330
  dbengine tier 2 multihost disk space MB = 67
  cache directory = /home/netdata
[web]
  bind to = *
  web files owner = root
  web files group = netdata
[directories]
  cache = /home/netdata
  home = /home/netdata

This server is a Master Stream with around 25 servers streaming to it.
The version is

root@ns3117200:~# netdata -v
netdata v1.44.1

I don’t really know where to start the investigation / what could be the issue based on the logs …

Any idea ?

Regards

Hello @DeWaRs1206 ,

What is the exact frequency your netdata restarts?
I observed the following message:

SIGNAL: Received SIGTERM. Cleaning up to exit...

Do you have a cron job that sends SIGTERM for it?

Best regards!

Hello @Thiago_Marques_0

Thanks for your answer. I don’t have any cronjob that would restart netdata, and the frequency is kind of random. Multiple times a day it restart.

Looking at the journalctl logs, I found:

Jan 12 17:31:47 ns3117200 systemd[1]: Stopping A high performance web server and a reverse proxy server...
Jan 12 17:31:47 ns3117200 systemd[1]: nginx.service: Deactivated successfully.
Jan 12 17:31:47 ns3117200 systemd[1]: Stopped A high performance web server and a reverse proxy server.
Jan 12 17:31:47 ns3117200 systemd[1]: Starting A high performance web server and a reverse proxy server...
Jan 12 17:31:47 ns3117200 systemd[1]: Started A high performance web server and a reverse proxy server.
Jan 12 17:31:47 ns3117200 systemd[1]: Stopping Real time performance monitoring...
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:38::a#53
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:38::a#53
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:32::a#53
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:32::a#53
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:34::a#53
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:34::a#53
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:36::a#53
Jan 12 17:31:47 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:36::a#53
Jan 12 17:31:51 ns3117200 kernel: netdata[2132668]: segfault at 7f428c483910 ip 00007f42b7504bdd sp 00007ffe7603ff70 error 4 in libc.so.6[7f42b749a000+195000]
Jan 12 17:31:51 ns3117200 kernel: Code: 08 5b 5d c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 55 48 81 ec a0 00 00 00 64 48 8b 04 25 28 00 00 00 48 89 84 24 98 00 00 00 <8b> 8>
Jan 12 17:31:51 ns3117200 systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Jan 12 17:31:51 ns3117200 systemd[1]: netdata.service: Failed with result 'signal'.
Jan 12 17:31:51 ns3117200 systemd[1]: Stopped Real time performance monitoring.
Jan 12 17:31:51 ns3117200 systemd[1]: netdata.service: Consumed 2h 43min 37.957s CPU time.
Jan 12 17:31:51 ns3117200 systemd[1]: Started Real time performance monitoring.

Looks like netdata is generating a segfault during restart, but this is not the issue I guess.
Weird that NGINX is restarting as well …

Well … I feel stupid …

The server is running (for an unknown reason) chef-client on a regular basis …
Configuration changes (new nodes detected so stream.conf is updated automatically and service is restarted …)

This is totally on my side, sorry for this stupid post …

Have a great week-end

Hello @DeWaRs1206 ,

Firstly, I never considered a “stupid post”.
I was investigating your last answer, and I am glad that you could fix it.
Please, whenever you have any doubt, let us know, because we are here to help you and improve netdata.

Best regards,

Thiago

I am facing a similar issue. My netdata service is randomly restarting but no error logs are generated. As a result, I am getting unreachable alerts even if the server is online.
Can someone help with this? I am not sure what to do at this point.