net_rx_latency_ms=10395117 bug

Problem/Question

On one linux server from many I’m getting insane results for softirq latency. On top of that it provides value every so often not like every second or minute…

Relevant docs you followed/actions you took to solve the issue

Environment/Browser/Agent’s version etc

Linux, v1.38.0-18-nightly, Chrome

What I expected to happen

Something like this: (Taken from another node)

Hello @ben ,

Do you have messags like these

"Cannot read /proc/softirqs, zero lines reported."
"PLUGIN: PROC_SOFTIRQS: Cannot find the number of CPUs in /proc/softirqs"

inside your /var/log/netdata/collector.log?

If you do not have, can you share with us the output of the command

grep -w proc /var/log/netdata/collector.log

?

If you prefer you can send an email for thiago@netdata.cloud.

The fact that this is happening with only one host indicates issues to collect data for this host.

We also expect do not deliver more charts like this, because the issue was already fixed with PR do not report dimensions that failed to be queried by ktsaou · Pull Request #14447 · netdata/netdata · GitHub.

Best regards!

cat /var/log/netdata/collector.log | grep "softirq"

Completely empty.

grep -w proc /var/log/netdata/collector.log

Returns

2023-02-09 17:39:31:  ERROR : MAIN : Cannot open /proc/1325956/status
2023-02-09 17:40:25:  ERROR : MAIN : PROCFILE: Cannot open file '/proc/1327355/status' (errno 2, No such file or directory)
2023-02-09 17:40:25:  ERROR : MAIN : Cannot open /proc/1327355/status
2023-02-09 17:41:20:  ERROR : MAIN : PROCFILE: Cannot open file '/proc/1330880/status' (errno 2, No such file or directory)
2023-02-09 17:41:20:  ERROR : MAIN : Cannot open /proc/1330880/status
2023-02-09 17:42:14:  ERROR : MAIN : PROCFILE: Cannot open file '/proc/1338159/status' (errno 2, No such file or directory)
2023-02-09 17:42:14:  ERROR : MAIN : Cannot open /proc/1338159/status
2023-02-09 17:43:08:  ERROR : MAIN : PROCFILE: Cannot open file '/proc/1340535/status' (errno 2, No such file or directory)
2023-02-09 17:43:08:  ERROR : MAIN : Cannot open /proc/1340535/status
2023-02-09 17:44:03:  ERROR : MAIN : PROCFILE: Cannot open file '/proc/1345362/status' (errno 2, No such file or directory)
2023-02-09 17:44:03:  ERROR : MAIN : Cannot open /proc/1345362/status
2023-02-09 17:44:57:  ERROR : MAIN : PROCFILE: Cannot open file '/proc/1349008/status' (errno 2, No such file or directory)
2023-02-09 17:44:57:  ERROR : MAIN : Cannot open /proc/1349008/status
2023-02-09 17:45:51:  ERROR : MAIN : PROCFILE: Cannot open file '/proc/1350483/status' (errno 2, No such file or directory)
2023-02-09 17:45:51:  ERROR : MAIN : Cannot open /proc/1350483/status
....

For more information

cat  /var/log/netdata/collector.log | grep "proc" | grep -v "No such file" | grep -v "/status"
2023-02-09 06:59:30: nfacct.plugin INFO  : MAIN : NFACCT process exiting
2023-02-09 07:09:42: netdata INFO  : P[proc netdev] : cleaning up...
2023-02-09 07:09:45: netdata INFO  : P[proc] : cleaning up...
2023-02-09 07:10:06: go.d INFO: main[main] using config: enabled 'true', default_run 'true', max_procs '0'
2023-02-09 07:10:07: go.d ERROR: prometheus[proc_exporter_local] Get "http://127.0.0.1:9227/metrics": dial tcp 127.0.0.1:9227: connect: connection refused
2023-02-09 07:10:07: go.d ERROR: prometheus[proc_exporter_local] check failed
2023-02-09 07:10:07: go.d ERROR: prometheus[vector_packet_process_vpp_exporter_local] Get "http://127.0.0.1:9482/metrics": dial tcp 127.0.0.1:9482: connect: connection refused
2023-02-09 07:10:07: go.d ERROR: prometheus[vector_packet_process_vpp_exporter_local] check failed
2023-02-09 07:10:07: go.d ERROR: prometheus[opvizor_performance_analyzer_process_exporter_local] Get "http://127.0.0.1:9585/metrics": dial tcp 127.0.0.1:9585: connect: connection refused
2023-02-09 07:10:07: go.d ERROR: prometheus[opvizor_performance_analyzer_process_exporter_local] check failed
2023-02-09 07:10:07: go.d ERROR: prometheus[exporter_for_grouped_process_local] Get "http://127.0.0.1:9644/metrics": dial tcp 127.0.0.1:9644: connect: connection refused
2023-02-09 07:10:07: go.d ERROR: prometheus[exporter_for_grouped_process_local] check failed
2023-02-09 07:10:07: go.d ERROR: prometheus[processor_counter_monitor_exporter_local] Get "http://127.0.0.1:9738/metrics": dial tcp 127.0.0.1:9738: connect: connection refused
2023-02-09 07:10:07: go.d ERROR: prometheus[processor_counter_monitor_exporter_local] check failed
2023-02-09 11:10:10: nfacct.plugin INFO  : MAIN : NFACCT process exiting
2023-02-09 15:10:20: nfacct.plugin INFO  : MAIN : NFACCT process exiting

This is super strange!!! because [GONE] Firewall(netfilter) > netlink > connection tracker `errors`/`searches` - #14 by Austin_Hemmelgarn in this post I got answer that nfacct isn’t being installed on netdata with default installation? But I have it! on 2 nodes.(because I can see plenty of information in connection tracker like search restart and e.t.c) And both of these nodes have bug mentioned in this topic.

Hello @ben ,

Sorry for the delay to answer.
Thank you for the output. The logs are not showing any error from the thread that collects data.
The errors from grep -w proc are showing errors that are probably from apps or ebpf that tries to parse these files.
Now about nfacct plugin, I will let @Austin_Hemmelgarn to speak about, because in normal situation I always compile netdata from scratch.
The association between nfacct and error was not something that I thought initially, but I cannot discard completely, because there is a relationship between IPTABLES (Netfilter) and softirq.
@Austin_Hemmelgarn is it possible to run a rpm SOMETHING or dpkg SOMETHING that shows relationship between netdata packages and dependencies?
@ben are you using netdata official repo? Or did you get netdata from dist repo?

Best regards!

I used netdata kickstart.sh