Netdata agent restarting randomly

Environment

  • Ubuntu 20.04 VPS
  • Only has netdata stable installed with minimal config changes

Problem/Question

I’m evaluating netdata agent & cloud as a potential solution for monitoring a few servers, however I am receiving email alerts about, " X is unreachable" and then shortly after “X is reachable” this happens ~2-3 times during a 24 hr period.

I can observe this outage via netdata cloud:

Digging a bit into the issue, it would appear to be an issue with the agent restarting. I’m not particularly familure with netdata and it’s logs and am hoping for some guidance as to understand if this is caused by user error or is indeed a bug? (I’m happy to open an issue on github if required)

I originally saw this issue when using the nightly branch (installed via the kickstart script), but assumed it restarts were due to auto-updates which is why I switched to the stable branch, yet I continue to see these odd agent restarts.

The logs I’m sharing are from a new Ubuntu 20.04 VPS with only netdata installed on it.

What I expected to happen

I don’t expect the netdata agent to “randomly” restart. I’m happy to provide any additional logs.

Logs & config files

/etc/netdata/.environment

# Created by installer
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
CFLAGS="-O2"
LDFLAGS=""
NETDATA_TMPDIR="/tmp"
NETDATA_PREFIX=""
NETDATA_CONFIGURE_OPTIONS=" --with-bundled-lws=externaldeps/libwebsockets"
NETDATA_ADDED_TO_GROUPS=" adm proxy"
INSTALL_UID="0"
NETDATA_GROUP="netdata"
REINSTALL_OPTIONS="--auto-update --disable-telemetry --stable-channel "
RELEASE_CHANNEL="stable"
IS_NETDATA_STATIC_BINARY="no"
NETDATA_LIB_DIR="/var/lib/netdata"

journalctl -u netdata.service

Feb 19 07:09:04 netdata-test ebpf.plugin[84212]: PROCFILE: Cannot open file '/etc/netdata/apps_groups.conf'
Feb 19 07:09:04 netdata-test ebpf.plugin[84212]: Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Will try '/usr/lib/netdata/conf.d/apps_groups.conf'
Feb 24 07:12:36 netdata-test systemd[1]: /lib/systemd/system/netdata.service:14: PIDFile= references a path below legacy directory /var/run/, updating /var/run/netdata/netdata.pid → /run/netdata/netdata.>
Feb 24 07:12:36 netdata-test systemd[1]: /lib/systemd/system/netdata.service:14: PIDFile= references a path below legacy directory /var/run/, updating /var/run/netdata/netdata.pid → /run/netdata/netdata.>
Feb 24 07:12:36 netdata-test systemd[1]: Stopping Real time performance monitoring...
Feb 24 07:12:38 netdata-test systemd[1]: netdata.service: Succeeded.
Feb 24 07:12:38 netdata-test systemd[1]: Stopped Real time performance monitoring.
Feb 24 07:12:43 netdata-test systemd[1]: Starting Real time performance monitoring...
Feb 24 07:12:43 netdata-test systemd[1]: Started Real time performance monitoring.
Feb 24 07:12:43 netdata-test netdata[135193]: SIGNAL: Not enabling reaper
Feb 24 07:12:43 netdata-test netdata[135193]: 2021-02-24 07:12:43: netdata INFO  : MAIN : SIGNAL: Not enabling reaper
Feb 24 07:12:43 netdata-test systemd[1]: Stopping Real time performance monitoring...
Feb 24 07:12:43 netdata-test ebpf.plugin[135278]: Does not have a configuration file inside `/etc/netdata/ebpf.conf. It will try to load stock file.
Feb 24 07:12:43 netdata-test ebpf.plugin[135278]: Name resolution is disabled, collector will not parser "hostnames" list.
Feb 24 07:12:43 netdata-test ebpf.plugin[135278]: The network value of CIDR 127.0.0.1/8 was updated for 127.0.0.0 .
Feb 24 07:12:43 netdata-test ebpf.plugin[135278]: PROCFILE: Cannot open file '/etc/netdata/apps_groups.conf'
Feb 24 07:12:43 netdata-test ebpf.plugin[135278]: Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Will try '/usr/lib/netdata/conf.d/apps_groups.conf'
Feb 24 07:12:51 netdata-test systemd[1]: netdata.service: Succeeded.
Feb 24 07:12:51 netdata-test systemd[1]: Stopped Real time performance monitoring.
Feb 24 07:12:56 netdata-test systemd[1]: /lib/systemd/system/netdata.service:14: PIDFile= references a path below legacy directory /var/run/, updating /var/run/netdata/netdata.pid → /run/netdata/netdata.pid; please update the unit file accordingly.

/var/log/netdata/error.log

2021-02-24 07:07:57: netdata INFO  : PLUGINSD[apps] : RRDSET: chart name 'apps.pipes' on host 'netdata-test' already exists.
2021-02-24 07:12:36: netdata INFO  : MAIN : SIGNAL: Received SIGTERM. Cleaning up to exit...
2021-02-24 07:12:36: netdata INFO  : MAIN : Shutting down command server.
2021-02-24 07:12:36: netdata ERROR : PLUGINSD[apps] : read failed: end of file
2021-02-24 07:12:36: netdata INFO  : PLUGINSD[apps] : PARSER ended
2021-02-24 07:12:36: netdata ERROR : PLUGINSD[apps] : '/usr/libexec/netdata/plugins.d/apps.plugin' (pid 84216) disconnected after 5791614 successful data collections (ENDs).
2021-02-24 07:12:36: netdata ERROR : PLUGINSD[apps] : child pid 84216 killed by signal 15.
2021-02-24 07:12:36: netdata INFO  : PLUGINSD[apps] : '/usr/libexec/netdata/plugins.d/apps.plugin' (pid 84216) was killed with SIGTERM. Disabling it.
2021-02-24 07:12:36: netdata ERROR : PLUGIN[tc] : child pid 122172 killed by signal 15.
2021-02-24 07:12:36: netdata INFO  : MAIN : Shutting down command event loop.
2021-02-24 07:12:36: netdata INFO  : MAIN : Shutting down command loop complete.
2021-02-24 07:12:36: netdata ERROR : PLUGINSD[go.d] : read failed: end of file (errno 9, Bad file descriptor)
2021-02-24 07:12:36: netdata INFO  : PLUGINSD[go.d] : PARSER ended
2021-02-24 07:12:36: netdata ERROR : PLUGINSD[go.d] : '/usr/libexec/netdata/plugins.d/go.d.plugin' (pid 84213) disconnected after 0 successful data collections (ENDs).
2021-02-24 07:12:36: netdata INFO  : PLUGINSD[go.d] : '/usr/libexec/netdata/plugins.d/go.d.plugin' (pid 84213) does not generate useful output but it reports success (exits with 0). Waiting a bit before starting it again..
2021-02-24 07:12:36: netdata INFO  : MAIN : Command server has stopped.
2021-02-24 07:12:36: netdata INFO  : PLUGINSD[apps] : thread with task id 84204 finished
2021-02-24 07:12:36: netdata INFO  : MAIN : EXIT: netdata prepares to exit with code 0...
2021-02-24 07:12:36: netdata INFO  : MAIN : EXIT: cleaning up the database...
2021-02-24 07:12:36: netdata INFO  : MAIN : Cleaning up database [1 hosts(s)]...
2021-02-24 07:12:36: netdata INFO  : MAIN : Cleaning up database of host 'netdata-test'...
2021-02-24 07:12:36: netdata INFO  : MAIN : EXIT: stopping static threads...

/etc/netdata/netdata.conf (only changes made to default config)

[global]
        update every = 5
...
[system.entropy]
        enabled = no
...

Hi @jonathank

What is your netdata version?

[ilyam@pc ~]$ /opt/netdata/usr/sbin/netdata -v
netdata v1.29.3-8-gb3edd322

Current stable is v1.29.3, it was released yesterday. There is a bug in v1.29.2, - Netdata Agent crashes on cleanup stale containers metrics.

Thanks for getting back to me so quick @ilyam8

I appear to be running the latest.

root@netdata-test:~# netdata -v
netdata v1.29.3

Given that this release was applied yesterday automatically I can give it a few days to see if the issues persists and follow back up here.

If there is another problem, to debug it, we need steps to reproduce. Usually it is not clear.

In that way, it would be nice to get the trace, it is possible if we compile/install netdata with

(needs matching libasan for your gcc, in Ubuntu 18.04 it’s libasan4)

sudo CFLAGS="-O0 -ggdb -fsanitize=address -fno-omit-frame-pointer -fstack-protector-all" ./netdata-installer.sh

After the crash you can find in the error.log something like that

Crash: Main process exited, code=killed, status=11/SEGV · Issue #10656 · netdata/netdata · GitHub

@ilyam8 it would appear that v1.29.3 has resolved the agent restart issue I was running into, thank you for your support.