Agent 1.33.0 won't start after successful installation

We have updated around 50 hosts to version 1.33.0 today, all of which are running Ubuntu with different versions ranging from 16 to 20. All updates went well except for one.

The installation with kickerstart.sh works through the whole process without any errors until it’s trying to start the service. It takes for ages and then fails with a timeout. Here is the relevant log from journalctl:

-- Unit netdata.service has begun starting up.
Jan 27 14:49:28 pcweb1 netdata[5368]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Jan 27 14:49:28 pcweb1 netdata[5368]: 2022-01-27 14:49:28: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Jan 27 14:49:28 pcweb1 systemd[1]: netdata.service: Can't open PID file /var/run/netdata.pid (yet?) after start: No such file or directory
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Does not have a configuration file inside `/etc/netdata/ebpf.d.conf. It will try to load stock file.
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Name resolution is disabled, collector will not parser "hostnames" list.
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: The network value of CIDR 127.0.0.1/8 was updated for 127.0.0.0 .
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: PROCFILE: Cannot open file '/etc/netdata/apps_groups.conf'
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Will try '/usr/lib/netdata/conf.d/apps_groups.conf'
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: PROCFILE: Cannot open file '/proc/4142/status'
Jan 27 14:49:28 pcweb1 ebpf.plugin[5642]: Cannot open /proc/4142/status
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Start operation timed out. Terminating.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5370 (netdata) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5372 (netdata) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5627 (bash) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5629 (python3) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5630 (apps.plugin) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5642 (ebpf.plugin) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5648 (nfacct.plugin) with signal SIGKILL.
Jan 27 14:50:58 pcweb1 systemd[1]: netdata.service: Killing process 5650 (go.d.plugin) with signal SIGKILL.
Jan 27 14:51:01 pcweb1 systemd[1]: netdata.service: Failed with result 'timeout'.
Jan 27 14:51:01 pcweb1 systemd[1]: Failed to start Linux real time system monitoring, over the web.

The error log of netdata doesn’t tell anything specific and I have no idea why this should be timing out.

How can I further debug this or is anyone having an idea what could be wrong?

Can you please share the output of netdata -W buildinfo from the affected system? I think I know what0s wrong, and the info there should help confirm it.

Version: netdata v1.33.0
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
    Binary architecture: x86_64
    Packaging distro:  
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

Here you go.

Thanks for the quick reply.

I think what’s going on here is an issue we’ve seen intermittently with our binary packages where systemd does not try to use the correct path when looking for Netdata’s PID file.

On further inspection, it looks like this is actually a bug on our end.

Until we get if fixed on our end, you should be able to fix it by using systemctl edit netdata.service to create an override file containing the following:

[Service]
PIDFile=/var/run/netdata/netdata.pid
ExecStart=/usr/sbin/netdata -P /var/run/netdata/netdata.pid -D

When I then start the service, it fails with the error message netdata.service: Service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing.

Ah, yes, I forgot that systemd overrides don’t actually override things, so special handling is needed

[Service]
PIDFile=/var/run/netdata/netdata.pid
ExecStart=
ExecStart=/usr/sbin/netdata -P /var/run/netdata/netdata.pid -D

should do it instead.

Thanks @Austin_Hemmelgarn for your support on this. Unfortunately it still doesn’t work. With the latest changes in the service config, the override issue disappeared again. But the timeout issue came back.

I’ve checked and saw that /var/run/netdata/ was missing, so I created it with netdata as the owner. But still, the issue persists.

@jurgenhaas, hi.

Can you show systemctl cat netdata?

Sure, it’s this:

# /etc/systemd/system/netdata.service
[Unit]
Description=Linux real time system monitoring, over the web
After=network.target httpd.service squid.service nfs-server.service mysqld.service named.service postfix.service

[Service]
Type=forking
WorkingDirectory=/tmp
User=root
Group=root
PIDFile=/var/run/netdata.pid
ExecStart=/usr/sbin/netdata -pidfile /var/run/netdata.pid
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/netdata.service.d/override.conf
[Service]
PIDFile=/var/run/netdata/netdata.pid
ExecStart=
ExecStart=/usr/sbin/netdata -P /var/run/netdata/netdata.pid -D

What OS/version is on that host?

On a debian/ubuntu with systemd is supposed to be one of:

Is that your custom systemd unit file? I think the system’s copy is under /lib/systemd/system/

It’s Ubuntu 18.04 but that systemd file dates back to 2016, the date when NetData was first installed on that host. At the time this was Ubuntu 14.

Is it possible, that the systemd file got created back then and the installer isn’t overwriting or updating that? If so, what’s your suggestion on how to clean that up?

Does /lib/systemd/system/netdata.service exist?

Yes, and that looks more like the one you linked to in the source above.

Not an expert, but I hope the following will do:

rm /etc/systemd/system/netdata.service
rm /etc/systemd/system/netdata.service.d/override.conf
systemctl daemon-reload
systemctl restart netdata

That’s it, thanks so much. This is now resolved.

So I should add such cleanup tasks to our Ansible deployment, unless this is something that goes into the update script of netdata?

To be honest I have no idea how come netdata.service ended up in /etc/systemd/system/.

But yeah, you need to remove /etc/systemd/system/netdata.service if /lib/systemd/system/netdata.service exists.

Same problem as per systemctl status netdata:

Process: 69xx ExecStart=/opt/netdata/usr/sbin/netdata -P /opt/netdata/var/run/netdata/netdata.pid -D (code=exited, status=1/FAILURE)

Is this override.conf needed?

# /etc/systemd/system/netdata.service.d/override.conf
[Service]
PIDFile=/var/run/netdata/netdata.pid
ExecStart=
ExecStart=/usr/sbin/netdata -P /var/run/netdata/netdata.pid -D

Please clarify: What is where needed(to be removed) to start netdata?

Version: netdata v1.33.1-165-g0ac6b4000 on Arch Linux reinstalled with
“wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh --reinstall”