Hello,
I have an issue with my Netdata master which is restarting randomly on regular basis
In the syslog, I can see
Jan 11 20:41:55 ns3117200 systemd[1]: apt-news.service: Deactivated successfully.
Jan 11 20:41:55 ns3117200 systemd[1]: Finished Update APT News.
Jan 11 20:41:55 ns3117200 named[1035]: address not available resolving 'esm.ubuntu.com/A/IN': 2620:2d:4000:1::44#53
Jan 11 20:41:56 ns3117200 systemd[1]: esm-cache.service: Deactivated successfully.
Jan 11 20:41:56 ns3117200 systemd[1]: Finished Update the local ESM caches.
Jan 11 20:41:57 ns3117200 systemd[1]: Stopping Real time performance monitoring...
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:38::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:38::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:34::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:34::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:36::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:36::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/A/IN': 2001:4860:4802:32::a#53
Jan 11 20:41:57 ns3117200 named[1035]: address not available resolving 'us-east1-netdata-analytics-bi.cloudfunctions.net/AAAA/IN': 2001:4860:4802:32::a#53
Jan 11 20:42:13 ns3117200 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Jan 11 20:42:25 ns3117200 systemd[1]: netdata.service: Deactivated successfully.
Jan 11 20:42:25 ns3117200 systemd[1]: Stopped Real time performance monitoring.
Jan 11 20:42:25 ns3117200 systemd[1]: netdata.service: Consumed 2h 51min 22.004s CPU time.
Jan 11 20:42:25 ns3117200 systemd[1]: Started Real time performance monitoring.
In journalctl logs:
Jan 11 20:41:57 ns3117200 netdata[1969387]: SIGNAL: Received SIGTERM. Cleaning up to exit...
Jan 11 20:41:57 ns3117200 netdata[1969387]: Shutting down command server.
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="received terminated signal (15). Terminating..." plugin=go.d component=agent
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="discovery manager"
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=logind job=logind
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=web_log job=nginx
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=httpcheck job=httpcheck
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=test_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=qa_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=devops_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=gitlab_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=gitlab_registry_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=sonarqube_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg=stopped plugin=go.d collector=x509check job=demo_anonymised
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="discovery file"
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="functions manager"
Jan 11 20:41:57 ns3117200 netdata[1969387]: Shutting down command event loop.
Jan 11 20:41:57 ns3117200 netdata[1969387]: Shutting down command loop complete.
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969790 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969829]: level=info msg="instance is stopped" plugin=go.d component="filestatus manager"
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969775 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: Command server has stopped.
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: initializing shutdown with code 0...
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 2021784 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: Cannot waitid() for pid 2023262
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: next: create shutdown file
Jan 11 20:41:57 ns3117200 netdata[1969387]: PARSER: read failed: POLLHUP.
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in 0 ms, create shutdown file - next: dbengine exit mode
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in 0 ms, dbengine exit mode - next: close webrtc connections
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in 0 ms, close webrtc connections - next: disable maintenance, new queries, new web requests, new stream>
Jan 11 20:41:57 ns3117200 netdata[1969387]: NETDATA SHUTDOWN: in 0 ms, disable maintenance, new queries, new web requests, new streaming connections and aclk - next: >
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969792 killed by SIGTERM
Jan 11 20:41:57 ns3117200 netdata[1969387]: child pid 1969804 killed by SIGTERM
My netdata.conf
[global]
hostname = NETDATA-MASTER
enabled = yes
[db]
mode = dbengine
update every = 1
storage tiers = 3
dbengine multihost disk space MB = 1100
dbengine tier 1 multihost disk space MB = 330
dbengine tier 2 multihost disk space MB = 67
cache directory = /home/netdata
[web]
bind to = *
web files owner = root
web files group = netdata
[directories]
cache = /home/netdata
home = /home/netdata
This server is a Master Stream with around 25 servers streaming to it.
The version is
root@ns3117200:~# netdata -v
netdata v1.44.1
I don’t really know where to start the investigation / what could be the issue based on the logs …
Any idea ?
Regards