I monitor a number of machines on netdata cloud, but recently all but one recently stopped displaying data from nvidia-smi, and nvidia-smi disappeared from the list of collectors on the right side of the dashboard. It is still enabled in /etc/netdata/python.d.conf. I can run nvidia-smi as netdata. The affected machines are running v1.39.0-84-nightly on Ubuntu 20.04 and have 4 gpus each. A single-gpu machine running v1.38.0-376-nightly on Ubuntu 22.04 continues to report gpu info properly.
Following the recommended troubleshooting steps, I discovered that /usr/libexec/netdata/plugins.d/python.d.plugin is missing. What could cause that to happen, and how do I fix it?
netdata@mach10:/usr/libexec/netdata/plugins.d$ ./python.d.plugin nvidia_smi debug trace
bash: ./python.d.plugin: No such file or directory
netdata@mach10:/usr/libexec/netdata/plugins.d$ ls
acl.sh alarm-test.sh cgroup-network-helper.sh ioping.plugin tc-qos-helper.sh
alarm-email.sh anonymous-statistics.sh ebpf.d loopsleepms.sh.inc template_dim.sh
alarm-notify.sh cgroup-name.sh get-kubernetes-labels.sh request.sh
alarm.sh cgroup-network health-cmdapi-test.sh system-info.sh
netdata@mach10:/usr/libexec/netdata/plugins.d$
It might be important that nedata.conf looks like it is missing most settings, including the [plugins] section, although plugins are enabled by default and the single-gpu machine running v1.38.0-376-nightly works with the same netdata.conf. I tried reinstalling netdata with kickstart.sh, stopping and restarting it with systemctl, etc to no avail. I tried replacing the netdata.conf file with netdatacli dumpconfig > /etc/netdata/netdata.conf
but get a permission error that persists whether run with sudo or under the netdata user.
ps aux | grep netdata shows only a few processes, and not the plugins running.
netdata 1861619 15.6 0.0 2057856 84248 ? SNsl 12:08 0:02 /usr/sbin/netdata -D -P /var/run/netdata/netdata.pid
netdata 1861621 0.0 0.0 28016 9988 ? SNl 12:08 0:00 /usr/sbin/netdata --special-spawn-server
netdata 1862023 0.1 0.0 4288 3336 ? SN 12:08 0:00 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
andrew 1863032 0.0 0.0 17692 2456 pts/0 S+ 12:09 0:00 grep --color=auto netdata
On a machine that is still reporting gpu info, the same command gives
netdata 3362233 2.6 0.3 1116736 107744 ? SNsl 09:15 4:46 /usr/sbin/netdata -D -P /var/run/netdata/netdata.pid
netdata 3362236 0.0 0.0 27620 9696 ? SNl 09:15 0:00 /usr/sbin/netdata --special-spawn-server
root 3362622 0.0 0.0 8544 2416 ? SN 09:15 0:01 /usr/libexec/netdata/plugins.d/nfacct.plugin 1
netdata 3362624 3.2 0.0 140116 13596 ? SNl 09:15 5:47 /usr/libexec/netdata/plugins.d/apps.plugin 1
netdata 3362628 1.3 0.1 1661064 48844 ? SNl 09:15 2:25 /usr/bin/python3 /usr/libexec/netdata/plugins.d/python.d.plugin 1
netdata 3362630 0.0 0.0 0 0 ? ZN 09:15 0:05 [bash] <defunct>
netdata 3362633 0.5 0.1 775204 50572 ? SNl 09:15 0:58 /usr/libexec/netdata/plugins.d/go.d.plugin 1
root 3362637 0.0 0.1 642568 32924 ? SNl 09:15 0:03 /usr/libexec/netdata/plugins.d/ebpf.plugin 1
netdata 3362902 4.1 0.1 47100 43024 ? SN 09:15 7:20 /usr/bin/nvidia-smi -x -q -l 1
andrew 3526937 0.0 0.0 9208 2396 pts/10 S+ 12:13 0:00 grep --color=auto netdata
I am guessing that this is a result of the 1.39 changes to how plugins are handled, but not sure what to install manually to get the plugins back.