Agent fails to start after host reboot - running as wrong user?

I recently installed the agent on a host running TrueNAS Scale (debian based). Installation was smooth and all data showed up on the cloud. Then I restarted the host and the agent failed to start. systemctl reported an error and journalctl reports the following:

Jun 11 14:32:38 truenas systemd[5122]: netdata.service: Failed to determine user credentials: No such file or directory
Jun 11 14:32:38 truenas systemd[5122]: netdata.service: Failed at step USER spawning /usr/sbin/netdata: No such file or directory
Jun 11 14:32:38 truenas systemd[1]: netdata.service: Main process exited, code=exited, status=217/USER
Jun 11 14:32:38 truenas systemd[1]: netdata.service: Failed with result 'exit-code'.

I suspect that netdata is running with the wrong user? I installed using the kickstarter script directly on the OS, but before that I also tried the netdata app from the appstore on TrueNAS scale - could there be a conflict? I uninstalled the app before installing from the kicstarter. journalctl still have references to it is seems:

Jun 11 14:32:56 truenas k3s[8088]: time="2023-06-11T14:32:56+02:00" level=info msg="Failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod \"svclb-netdata-e82ad63d-sf6vg_kube-system\": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container \"0f4014838a1e5560c900d72ef0c749d3cb58d6ba8e8a3bc08884a2bc10bdbaae\""

Hope you can point me in the right direction

This is not likely a case of running as the wrong user, it’s more likely that the user account simply does not exist, and possibly the binary doesn’t either.

Can you confirm that you actually have a netdata user on your system, and that /usr/sbin/netdata actually exists?

/usr/sbin/netdata exists, but the netdata user does not.
Shouldn’t kickstart.sh add the user? It was running fine after the installation, but then I guess it was running as the current user.

Hmm a reinstall fixed the user issue and systemctl status says it is running but the claim step to the cloud failed as it was already claimed. No data is showing up.
I can’t find where I can “unclaim” or delete the node.
The error message suggested deleting /var/lib/netdata/registry/netdata.public.unique.id and try again, but that only results in more copies of the same node showing up in cloud, but still no data :frowning:

All good now.
The original problem was probably caused by having both the Truenas Scale Netdata app (in a container) AND the normal install on the host OS installed at the same time and then removing the app.
I incorrectly assumed that the TrueNAS Scale app was self contained, but was wrong.

After banging my head against the wall for trying to re-claim the node in the cloud the solution was to use a new ID using --claim-id "$(uuidgen) and manuelly removing the stale ones in the cloud.

1 Like