Freeipmi.plugin - internal system error

Hi,

I need help. In my Centos7 the freeipmi don’t work because this error:

  • freeipmi.plugin ERROR : MAIN : ipmi_monitoring_sensor_readings_by_record_id(): internal system error (errno 9, Bad file descriptor)
  • freeipmi.plugin FATAL : freeipmi.plugin : freeipmi.plugin: data collection failed. # : Success

Does anyone know how to solve?

@Saruspete any thoughts?

The issue seems to come from freeipmi. “internal system error” message maps to libipmimonitoring/ipmi_monitoring.c :: ipmi_monitoring_errmsgs

The “errno 9, Bad file descriptor” looks like a syscall that failed, most likely due to a hardware specific implementation issue.

To get more details without recompiling freeipmi in debug, strace should help to get the culprit. @Alex can you run the following command, and post the (lengthy) result (as a file or a gist-like if possible)

sudo strace -fyy -s 1024 ./freeipmi.plugin
1 Like

Hi @Saruspete ,

On the server in question I can debug freeipmi smoothly and run. ERROR only gives when running Netdata daemon.

In debug mode, that’s ok:

Ok, so it’s working as expected when you’re running it with an interactive user, but not when running as a service, is that correct ?

If you’re running the daemon netdata in foreground ( sbin/netdata -D) as the target user, does it also fails with the errror ?
I suspect it’s due to the sandboxing from systemd service definition.

it’s working as expected when you’re running it with an interactive user, but not when running as a service

If it is due to CapabilityBoundingSet restrictions, i would try to reset it and see if it helps.

# as root
mkdir /etc/systemd/system/netdata.service.d
echo -e '[Service]\nCapabilityBoundingSet=~' | tee /etc/systemd/system/netdata.service.d/unset-capability-bounding-set.conf
systemctl daemon-reload
systemctl restart netdata.service
1 Like

I’m running the netdata daemon in the foreground and it doesn’t give an error, but it stays here:

it doesn’t send a lot of output in stdout, as soon as the logfiles are created all logging goes there.
So you should be able to connect to the interface and see if IPMI is displayed.

can you please test what @ilyam8 proposed ? It’s a clean override for systemd definition of the service.

Thanks, the @ilyam8 solution worked.

@ilyam8 and @Saruspete
After all it only worked on some servers … not on all!
Some servers continue with the same error.

can you check the override file is correctly used by systemd, with systemctl cat netdata.service ?

yes, file is correctly used.

#/etc/systemd/system/netdata.service.d/startup.conf
[Unit]
Wants=fscrypt.service
After=
After=fscrypt.service
#/etc/systemd/system/netdata.service.d/unset-capability-bounding-set.conf
[Service]
CapabilityBoundingSet=~

@Alex same symptoms? Works when running in the debug mode and doesn’t work when running as a service?

yes @ilyam8 , exactly the same symptoms.
On the servers that were ok, I changed the CapabilityBoundingSet again, and instead of CapabilityBoundingSet = ~, just be CapabilityBoundingSet = CAP_SYS_RAWIO

In those that do not work I let it stay the same CapabilityBoundingSet = ~

Hi @ilyam8 and @Saruspete ,

I installed the megacli on all servers, and it does not work on servers where the ipmi is also having problems. It seems to me to be a situation common to both.
Both are working well in debug mode, they just don’t work on the netdata daemon.
Do you have any idea what may be causing this problem and how to solve it?

As it’s a CentOS7, is there any selinux enabled (getenforce ) ?

Note: is there a script that can do these checks / reporting currently, of did it stay as an idea ?

It is not a problem with selinux.
All server configurations are the same (those that are and are not working).
I even have a server on which the IPMI was working without giving such an error, and as soon as I put the new configurations to work also the MegaCli, it stopped working. And he started to make that mistake.

Even if there’s low chance, still asking it: is there any way we could have access to the server ?
Such symptoms seems to be environment related, and as such highly difficult to reproduce and fix.

Else, maybe check any limitation per-user, like cgroups, ulimits, etc… ? freeipmi & megacli both require privileges to read /dev/ipmi/0 and /dev/sd*
Maybe we can do a strace of the netdata daemon in the systemd service, but as both plugins require caps or setuid to work, if i reember well, tracing will cause them to lose these privileges, hence fail to work. Maybe we can also try running netdata as root (changing the systemd unit & netdata config)

[Service]
ExecStart=
ExecStart=/usr/bin/strace -fyytt -s 1024 -o /tmp/netdata.strace /home/netdata/usr/sbin/netdata -P /run/netdata/netdata.pid -D

Note: the empty ExecStart is required before.

Thanks @Saruspete for your time spent helping me.

But after all, the problem was with systemd. I updated the systemd and were soon giving the plugins.

I had the same problem on Debian 11 and this resolved it.

Also /usr/libexec/netdata/plugins.d/freeipmi.plugin came with root group membership from the netdata-plugin-freeipmi package. Before I found the solution above, I was seeing the error sh: 1: exec: /usr/libexec/netdata/plugins.d/freeipmi.plugin: Permission denied (in /var/log/netdata/error.log). Since everything else in /usr/libexec/netdata/plugins.d/ was in netdata group, I ran chgrp netdata freeipmi.plugin and that’s when I started seeing the new error messages in OP.