We have updated around 50 hosts to version 1.33.0 today, all of which are running Ubuntu with different versions ranging from 16 to 20. All updates went well except for one.
The installation with kickerstart.sh works through the whole process without any errors until it’s trying to start the service. It takes for ages and then fails with a timeout. Here is the relevant log from journalctl:
-- Unit netdata.service has begun starting up.
Jan 27 14:49:28 pcweb1 netdata[5368]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Jan 27 14:49:28 pcweb1 netdata[5368]: 2022-01-27 14:49:28: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Jan 27 14:49:28 pcweb1 systemd[1]: netdata.service: Can't open PID file /var/run/netdata.pid (yet?) after start: No such file or directory
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Does not have a configuration file inside `/etc/netdata/ebpf.d.conf. It will try to load stock file.
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Name resolution is disabled, collector will not parser "hostnames" list.
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: The network value of CIDR 127.0.0.1/8 was updated for 127.0.0.0 .
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: PROCFILE: Cannot open file '/etc/netdata/apps_groups.conf'
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Will try '/usr/lib/netdata/conf.d/apps_groups.conf'
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: PROCFILE: Cannot open file '/proc/4142/status'
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Cannot open /proc/4142/status
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Start operation timed out. Terminating.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5370 (netdata) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5372 (netdata) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5627 (bash) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5629 (python3) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5630 (apps.plugin) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5642 (ebpf.plugin) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5648 (nfacct.plugin) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5650 (go.d.plugin) with signal SIGKILL.
Jan 27 14:51:01 pcweb1 systemd[1]: netdata.service: Failed with result 'timeout'.
Jan 27 14:51:01 pcweb1 systemd[1]: Failed to start Linux real time system monitoring, over the web.
The error log of netdata doesn’t tell anything specific and I have no idea why this should be timing out.
How can I further debug this or is anyone having an idea what could be wrong?
Can you please share the output of netdata -W buildinfo from the affected system? I think I know what0s wrong, and the info there should help confirm it.
I think what’s going on here is an issue we’ve seen intermittently with our binary packages where systemd does not try to use the correct path when looking for Netdata’s PID file.
On further inspection, it looks like this is actually a bug on our end.
Until we get if fixed on our end, you should be able to fix it by using systemctl edit netdata.service to create an override file containing the following:
When I then start the service, it fails with the error message netdata.service: Service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing.
Thanks @Austin_Hemmelgarn for your support on this. Unfortunately it still doesn’t work. With the latest changes in the service config, the override issue disappeared again. But the timeout issue came back.
I’ve checked and saw that /var/run/netdata/ was missing, so I created it with netdata as the owner. But still, the issue persists.
It’s Ubuntu 18.04 but that systemd file dates back to 2016, the date when NetData was first installed on that host. At the time this was Ubuntu 14.
Is it possible, that the systemd file got created back then and the installer isn’t overwriting or updating that? If so, what’s your suggestion on how to clean that up?
Please clarify: What is where needed(to be removed) to start netdata?
Version: netdata v1.33.1-165-g0ac6b4000 on Arch Linux reinstalled with
“wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh --reinstall”