Problem specifically with nvidia-smi collection

Arch Linux - installed from kickstart script after purging old (working) netdata instance from AURepo

Hi. When I try to enable nvidia-smi collection in python.d.conf , none of the python collectors work. When nvidia-smi is disabled I can view HDD Temp and Sensors. When nvidia-smi is enabled, nvidia-smi does not run and HDD Temp and Sensors also do not run.

I would expect this to work. Something that may be important is that when I try to use edit-config to edit nvidia-smi.conf I am not presented with a file. So I made the file myself.

I don’t think my homemade file is a problem bc it came from my old working install, but I do think that possible the lack of auto-complete with the stock filenames could be relevant to the problem I’m having. Is this a sign of something wrong with my install?

I’ve always struggled with nvidia-smi on netdata for some reason, but this time I’ve tried everything I know and can’t get it to work. I recently installed the newest version of Netdata from the kickscript on a pi, and then I wanted all the newest on my main arch server - which I wasn’t getting on my couple-year-old netdata instance from the AURepos. So I purged the package and then manually renamed all of the directories that were leftover, hoping the script would install everything fresh.

I have stream collecting and everything else working, but I can’t get nvidia-smi to play nice. Any ideas?

Should we start with trying to run the plugin from cli? I have verified that netdata can execute nvidia-smi. When i look at error logs, I see: “prometheus[nvidia_smi_exporter_local] Get “http://127.0.0.1:9454/metrics”: dial tcp 127.0.0.1:9454: connect: connection refused”

Which is odd because I’ve never installed prometheus. ?

Here is what I see when I try to run the plugin via CLI:

[sizz0p@arch python.d]sudo /usr/libexec/netdata/plugins.d/python.d.plugin nvidia_smi debug trace
2021-05-17 20:56:08: python.d INFO: plugin[main] : using python v3
2021-05-17 20:56:08: python.d DEBUG: plugin[main] : looking for 'python.d.conf' in ['/etc/netdata', '/usr/lib/netdata/conf.d']
2021-05-17 20:56:08: python.d DEBUG: plugin[main] : loading '/etc/netdata/python.d.conf'
2021-05-17 20:56:08: python.d ERROR: plugin[main] : error on loading '/etc/netdata/python.d.conf' : ParserError('while parsing a block mapping', <pyyaml3.error.Mark object at 0x7ff321d05e20>, "expected <block end>, but found '<block mapping start>'", <pyyaml3.error.Mark object at 0x7ff321c4ceb0>)
2021-05-17 20:56:08: python.d ERROR: plugin[main] : Traceback (most recent call last):
  File "/usr/libexec/netdata/plugins.d/python.d.plugin", line 516, in load_config
    config = load_config(abs_path)
  File "/usr/libexec/netdata/python.d/python_modules/bases/loaders.py", line 46, in load_config
    return load_yaml(stream)
  File "/usr/libexec/netdata/python.d/python_modules/bases/loaders.py", line 39, in load_yaml
    return loader.get_single_data()
  File "/usr/libexec/netdata/python.d/python_modules/pyyaml3/constructor.py", line 36, in get_single_data
    node = self.get_single_node()
  File "/usr/libexec/netdata/python.d/python_modules/pyyaml3/composer.py", line 37, in get_single_node
    document = self.compose_document()
  File "/usr/libexec/netdata/python.d/python_modules/pyyaml3/composer.py", line 56, in compose_document
    node = self.compose_node(None, None)
  File "/usr/libexec/netdata/python.d/python_modules/pyyaml3/composer.py", line 85, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/usr/libexec/netdata/python.d/python_modules/pyyaml3/composer.py", line 128, in compose_mapping_node
    while not self.check_event(MappingEndEvent):
  File "/usr/libexec/netdata/python.d/python_modules/pyyaml3/parser.py", line 99, in check_event
    self.current_event = self.state()
  File "/usr/libexec/netdata/python.d/python_modules/pyyaml3/parser.py", line 439, in parse_block_mapping_key
    raise ParserError("while parsing a block mapping", self.marks[-1],
pyyaml3.parser.ParserError: while parsing a block mapping
  in "/etc/netdata/python.d.conf", line 10, column 1
expected <block end>, but found '<block mapping start>'
  in "/etc/netdata/python.d.conf", line 77, column 2

…and now I’m gonna have to buy myself a beer. I feel like I went through this 3-4 years ago with nvidia-smi on netdata, so I’m going to leave this post with the solution for myself, the curious and the bored.

I was running: /usr/libexec/netdata/plugins.d/python.d.plugin nvidia_smi.chart.py debug trace

and I saw output about how my python.d.conf file not being valid and i saw a traceback error.

I think a control character was messed-up in python.d.conf. I had brought the file long from the old netdata instance when the config files weren’t auto-completing via cli (even though the stock directory is properly defined), so there’s no telling how it may have been mangled. ?

After removing the python.d.conf file I ran the above command again and saw that now nvidia_smi was disabled. So then I “echo “nvidia_smi: yes” >> /etc/netdata/python.d.conf” as root and then ran the command for the python collector again. This time it produced intelligible output.

So I restarted netdata and now the nvidia chart is there! The log lines about prometheus were a red herring, and despite having a pretty solid understanding of the netdata config and how it’s nested, they caused me to get stuck on the wrong aspects of the problem, rather than going back to the fundamentals. Live and learn.

1 Like

Hey @sizz0p, that is cool that you managed to find and fix your problem :+1:

If there is a problem with a python collector it is always a good idea to start with running the plugin in the debug mode.

1 Like