nvidia-smi stopped reporting, missing python.d.plugin file

I monitor a number of machines on netdata cloud, but recently all but one recently stopped displaying data from nvidia-smi, and nvidia-smi disappeared from the list of collectors on the right side of the dashboard. It is still enabled in /etc/netdata/python.d.conf. I can run nvidia-smi as netdata. The affected machines are running v1.39.0-84-nightly on Ubuntu 20.04 and have 4 gpus each. A single-gpu machine running v1.38.0-376-nightly on Ubuntu 22.04 continues to report gpu info properly.

Following the recommended troubleshooting steps, I discovered that /usr/libexec/netdata/plugins.d/python.d.plugin is missing. What could cause that to happen, and how do I fix it?

netdata@mach10:/usr/libexec/netdata/plugins.d$ ./python.d.plugin nvidia_smi debug trace
bash: ./python.d.plugin: No such file or directory
netdata@mach10:/usr/libexec/netdata/plugins.d$ ls
acl.sh		 alarm-test.sh		  cgroup-network-helper.sh  ioping.plugin	tc-qos-helper.sh
alarm-email.sh	 anonymous-statistics.sh  ebpf.d		    loopsleepms.sh.inc	template_dim.sh
alarm-notify.sh  cgroup-name.sh		  get-kubernetes-labels.sh  request.sh
alarm.sh	 cgroup-network		  health-cmdapi-test.sh     system-info.sh
netdata@mach10:/usr/libexec/netdata/plugins.d$ 

It might be important that nedata.conf looks like it is missing most settings, including the [plugins] section, although plugins are enabled by default and the single-gpu machine running v1.38.0-376-nightly works with the same netdata.conf. I tried reinstalling netdata with kickstart.sh, stopping and restarting it with systemctl, etc to no avail. I tried replacing the netdata.conf file with netdatacli dumpconfig > /etc/netdata/netdata.conf but get a permission error that persists whether run with sudo or under the netdata user.

ps aux | grep netdata shows only a few processes, and not the plugins running.

netdata  1861619 15.6  0.0 2057856 84248 ?       SNsl 12:08   0:02 /usr/sbin/netdata -D -P /var/run/netdata/netdata.pid
netdata  1861621  0.0  0.0  28016  9988 ?        SNl  12:08   0:00 /usr/sbin/netdata --special-spawn-server
netdata  1862023  0.1  0.0   4288  3336 ?        SN   12:08   0:00 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
andrew   1863032  0.0  0.0  17692  2456 pts/0    S+   12:09   0:00 grep --color=auto netdata

On a machine that is still reporting gpu info, the same command gives

netdata  3362233  2.6  0.3 1116736 107744 ?      SNsl 09:15   4:46 /usr/sbin/netdata -D -P /var/run/netdata/netdata.pid
netdata  3362236  0.0  0.0  27620  9696 ?        SNl  09:15   0:00 /usr/sbin/netdata --special-spawn-server
root     3362622  0.0  0.0   8544  2416 ?        SN   09:15   0:01 /usr/libexec/netdata/plugins.d/nfacct.plugin 1
netdata  3362624  3.2  0.0 140116 13596 ?        SNl  09:15   5:47 /usr/libexec/netdata/plugins.d/apps.plugin 1
netdata  3362628  1.3  0.1 1661064 48844 ?       SNl  09:15   2:25 /usr/bin/python3 /usr/libexec/netdata/plugins.d/python.d.plugin 1
netdata  3362630  0.0  0.0      0     0 ?        ZN   09:15   0:05 [bash] <defunct>
netdata  3362633  0.5  0.1 775204 50572 ?        SNl  09:15   0:58 /usr/libexec/netdata/plugins.d/go.d.plugin 1
root     3362637  0.0  0.1 642568 32924 ?        SNl  09:15   0:03 /usr/libexec/netdata/plugins.d/ebpf.plugin 1
netdata  3362902  4.1  0.1  47100 43024 ?        SN   09:15   7:20 /usr/bin/nvidia-smi -x -q -l 1
andrew   3526937  0.0  0.0   9208  2396 pts/10   S+   12:13   0:00 grep --color=auto netdata

I am guessing that this is a result of the 1.39 changes to how plugins are handled, but not sure what to install manually to get the plugins back.

@Andrew : Sorry for the inconvenience.
@Tasos_Katsoulas @Austin_Hemmelgarn : Is this related to the same issue of plugins being built separately?

Hello @Andrew, and welcome to our community. Yes, this inconvenience is a packaging issue during the recent changes to the packages shipped with Netdata Agent. Could you please update the Agent to the latest nightly?

Thanks for engaging. The affected machines have autoupdated to 1.39.0-89-nightly, but see no change. I assume that is because the nvidia-smi plugin package is still missing. The netdata.conf still looks like this:

# netdata configuration
#
# You can download the latest version of this file, using:
#
#  wget -O /etc/netdata/netdata.conf http://localhost:19999/netdata.conf
# or
#  curl -o /etc/netdata/netdata.conf http://localhost:19999/netdata.conf
#
# You can uncomment and change any of the options below.
# The value shown in the commented settings, is the default value.
#

[global]
    run as user = netdata

    # default storage size - increase for longer data retention
    page cache size = 32
    dbengine multihost disk space = 256

Is there supposed to be a [plugins] section? I assume plugins are enabled by default, because that is what the previous documentation mentioned. I’m not sure what documentation is still valid after the packaging change.

I’ve since updated netdata.conf by copying the contents of http://localhost:19999/netdata.conf , which does not actually change anything since the options are commented out. The new netdata.conf does not appear to have an entry for nvidia-smi that I can uncomment.

I then reinstalled netdata using

wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh --reinstall

The nvidia-smi plugin is still missing, so the re-install doesn’t seem to include it.

Reinstalling with the recommended script didn’t work.

Since I do not have the nvidia-smi plugin anymore, do I have to completely purge netdata from my system to be able to install it with the plugin?

Hey, @Andrew. Try [Bug]: Lost Metric Windows · Issue #15137 · netdata/netdata · GitHub

Thanks! That partly worked.

I have some reporting now, but the data displayed is all zeros for both nvidia-smi and sensors plugins, while the native programs give data in the command line.