Netdata interface is not reachable after some time

Bug report summary

It’s been some months already that this issue is occuring. Netdata was running flawlessly, sending notification when needed via slack. It also helped me discover issues on some external software that was running on my machine.

What I experience is that after some random time (usually hours) if I try to connect to the interface, it’s not reachable anymore.

this is output of sudo systemctl status netdata

nov 25 08:37:11 Praesidium systemd[1]: Starting Real time performance monitoring...
nov 25 08:37:11 Praesidium systemd[1]: Started Real time performance monitoring.
nov 25 08:37:11 Praesidium systemd[1]: Stopping Real time performance monitoring...
nov 25 08:37:11 Praesidium systemd[1]: netdata.service: Succeeded.
nov 25 08:37:11 Praesidium systemd[1]: Stopped Real time performance monitoring.
nov 25 08:37:16 Praesidium systemd[1]: /lib/systemd/system/netdata.service:14: PIDFile= references a path below legacy directory /var/run/, updating /var/run/netdata/netdata.pid → /run/netdata/netdata.pid; please update the unit file accordingly.

These are the last rows of the error.log file under /var/log/netdata, for only the ERROR status.

2020-11-25 08:37:04: netdata ERROR : PLUGINSD[go.d] : read failed: end of file (errno 9, Bad file descriptor)
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[go.d] : '/usr/libexec/netdata/plugins.d/go.d.plugin' (pid 3230073) disconnected after 0 successful data collections (ENDs).
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[charts.d] : read failed: end of file (errno 9, Bad file descriptor)
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[charts.d] : '/usr/libexec/netdata/plugins.d/charts.d.plugin' (pid 3230058) disconnected after 6670 successful data collections (ENDs).
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[apps] : read failed: end of file (errno 9, Bad file descriptor)
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[apps] : '/usr/libexec/netdata/plugins.d/apps.plugin' (pid 3230071) disconnected after 446488 successful data collections (ENDs).
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[apps] : child pid 3230071 killed by signal 15.
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[python.d] : read failed: end of file (errno 9, Bad file descriptor)
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[python.d] : '/usr/libexec/netdata/plugins.d/python.d.plugin' (pid 3230069) disconnected after 46664 successful data collections (ENDs).
2020-11-25 08:37:04: netdata ERROR : PLUGINSD[python.d] : child pid 3230069 killed by signal 15.
2020-11-25 08:37:16: netdata ERROR : MAIN : Health configuration cannot read file '/usr/lib/netdata/conf.d/health.d/sslcheck.conf'. (errno 13, Permission denied)
OS / Environment
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.10
/etc/lsb-release:DISTRIB_CODENAME=groovy
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.10"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.10 (Groovy Gorilla)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.10"
/etc/os-release:VERSION_ID="20.10"
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:VERSION_CODENAME=groovy
/etc/os-release:UBUNTU_CODENAME=groovy

Netdata version

netdata v1.26.0-247-nightly

Component Name

Netdata Agent

Steps To Reproduce
  1. Start Netdata
  2. Check that is running from interface
  3. Wait random time
  4. Interace is not reachable
Expected behavior

Netdata should work as expected

hey @meme1337 and welcome to our community!

Sorry for taking so long to respond, this topic slipped through the cracks.

It is weird that this problem popped out of the blue, our engineers will reply shortly and see what happened!

Thank you for bearing with us :muscle:

I’ve opened an issue for the warning from systemd at Systemd unit file should not prefix PID file path with our local state directory when installed to `/`. · Issue #10304 · netdata/netdata · GitHub, it’s incorrect behavior on our part, but is entirely orthogonal to the issue you seem to be experiencing.

As far as the issue with the interface becomming unavailable, the errors in your error log seem to indicate something wrong with your system as a whole. The various plugins should not be seeing errors involving bad file descriptors (as a general rule, getting such errors is a pretty good indication that either the code is doing something seriously stupid (which we aren’t here) or that something is wrong at a system level), so I’m inclined to believe that you may have some at-rest data corruption somewhere or possibly be seeing fallout from some other hardware issue. Can you confirm if you only see this issue on this one system, and if you also see it with a fresh install (you can re-run the installer with the same options you originally used and add the --reinstall option to get it to create a clean install while preserving your configuration and data).

Thank you both for your reply and feedback!

I tried reinstalling netdata one week ago, but the behavior is the same :frowning:

Is there something you can suggest me to check if there is some data corruption somewhere?

First thing I would check is that all of your hard drives report healthy via SMART tests. This can easily be verified for ATA, SCSI, and NVMe storage devices by running smartctl -H /dev/sda, replacing /dev/sda with the appropriate path to each device.

If that turns up clean, I would normally suspect hardware problems, but this sounds far to consistent and isolated to be hardware issues.

The other thing I would suggest trying is running Netdata in the foreground (just running netdata -D will do this) and seeing what happens. I suspect something is wrong in the core agent itself or in one of it’s dependencies which is causing it to terminate in a way that systemd decides not to restart it.

Thank you Austin!

As I was typing my last message I was thinking “well I could check SMART”, and indeed I found some issues there, likely due to the fact that this machine is an old mac mini repurposed as linux machine.
It’s weird that only netdata is showing symptoms, but anyway it’s time to buy a new machine to not have problems when it’s too late to fix those.

I just ordered a new pc to make it a proper server, and will get back in case I will face those issues again in the new environment.

Thank you so much for your time and support!
Netdata once again proved useful!

@meme1337 We are very glad that you found the issue, this is awesome!

Please, don’t forget to flag the post that solved your question as the “solution” and stick around in our little community :v: