Netdata losts downed network links

k0ste · September 15, 2022, 1:28pm

We have two LACP links on host:

bond0: ten0 + ten1 physdev’s
bond2: ten2 + ten3 physdev’s

The case: some TX issues on switch-side SFP+ transceiver. The problem link should be ten3, to troubleshot it I was set ten2 interface down:

ip link set ten2 down

The kernel:

[Thu Sep 15 14:23:50 2022] i40e 0000:04:00.0 ten2: NIC Link is Down
[Thu Sep 15 14:23:50 2022] bond2: (slave ten2): link status definitely down, disabling slave
[Thu Sep 15 14:23:50 2022] bond2: active interface up!

Then netdata lost this link for ever:

Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 duplex state by reading '/sys/class/net/ten2/duplex'. I will stop updating it. (errno 22, Invalid argument)
Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 carrier state by reading '/sys/class/net/ten2/carrier'. Stop updating it. (errno 22, Invalid argument)
Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 speed by reading '/sys/class/net/ten2/speed'. Will not update its speed anymore. (errno 22, Invalid argument)

When we replaced SFP+ transceiver for ten3 link and fixed the packet loss issues, I was set up ten2 link back:

[Thu Sep 15 20:04:41 2022] 8021q: adding VLAN 0 to HW filter on device ten2
[Thu Sep 15 20:04:41 2022] i40e 0000:04:00.0: ten2 is entering allmulti mode.
[Thu Sep 15 20:04:41 2022] bond2: (slave ten2): link status definitely down, disabling slave
[Thu Sep 15 20:04:42 2022] i40e 0000:04:00.0 ten2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Sep 15 20:04:42 2022] bond2: (slave ten2): link status definitely up, 10000 Mbps full duplex

But netdata is not catch up this link. Only solution I was find is a restart netdata

[root@ovirt-host4 ~]# curl -Ss "http://localhost:19999/api/v1/allmetrics?format=prometheus_all_hosts&server=localhost&source=average&variables=yes" 2>&1 | grep -E 'ten' | grep -E 'operstate|duplex|carrier'
netdata_net_carrier_state_average{chart="net_carrier.ten0",family="ten0",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten0",family="ten0",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten0",family="ten0",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten2",family="ten2",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_carrier_state_average{chart="net_carrier.ten1",family="ten1",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten1",family="ten1",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten1",family="ten1",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000
netdata_net_carrier_state_average{chart="net_carrier.ten3",family="ten3",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten3",family="ten3",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten3",family="ten3",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000

May be possible to tune netdata.conf for avoid this issue?

k0ste · September 15, 2022, 1:30pm

[root@ovirt-host4 ~]# netdata -V
netdata v1.34.1

ilyam8 · September 15, 2022, 3:56pm

Hi, @k0ste. Open a bug report.

k0ste · September 16, 2022, 1:36pm

github.com/netdata/netdata

[Bug]: Netdata losts downed network links

opened 01:33PM - 16 Sep 22 UTC

k0ste

bug needs triage

### Bug description We have two LACP links on host: * `bond0`: `ten0` + `ten…1` physdev's * `bond2`: `ten2` + `ten3` physdev's The case: some TX issues on switch-side SFP+ transceiver. The problem link should be `ten3`, to troubleshot it I was set `ten2` interface down: ```python ip link set ten2 down ``` The kernel: ```bash [Thu Sep 15 14:23:50 2022] i40e 0000:04:00.0 ten2: NIC Link is Down [Thu Sep 15 14:23:50 2022] bond2: (slave ten2): link status definitely down, disabling slave [Thu Sep 15 14:23:50 2022] bond2: active interface up! ``` Then `netdata` lost this link for ever: ```python Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 duplex state by reading '/sys/class/net/ten2/duplex'. I will stop updating it. (errno 22, Invalid argument) Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 carrier state by reading '/sys/class/net/ten2/carrier'. Stop updating it. (errno 22, Invalid argument) Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 speed by reading '/sys/class/net/ten2/speed'. Will not update its speed anymore. (errno 22, Invalid argument) ``` When we replaced SFP+ transceiver for `ten3` link and fixed the packet loss issues, I was set up `ten2` link back: ```python [Thu Sep 15 20:04:41 2022] 8021q: adding VLAN 0 to HW filter on device ten2 [Thu Sep 15 20:04:41 2022] i40e 0000:04:00.0: ten2 is entering allmulti mode. [Thu Sep 15 20:04:41 2022] bond2: (slave ten2): link status definitely down, disabling slave [Thu Sep 15 20:04:42 2022] i40e 0000:04:00.0 ten2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None [Thu Sep 15 20:04:42 2022] bond2: (slave ten2): link status definitely up, 10000 Mbps full duplex ``` But `netdata` is not catch up this link. Only solution I was find is a restart `netdata`🥲 ```bash [root@ovirt-host4 ~]# curl -Ss "http://localhost:19999/api/v1/allmetrics?format=prometheus_all_hosts&server=localhost&source=average&variables=yes" 2>&1 | grep -E 'ten' | grep -E 'operstate|duplex|carrier' netdata_net_carrier_state_average{chart="net_carrier.ten0",family="ten0",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000 netdata_net_operstate_state_average{chart="net_operstate.ten0",family="ten0",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000 netdata_net_duplex_state_average{chart="net_duplex.ten0",family="ten0",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000 netdata_net_operstate_state_average{chart="net_operstate.ten2",family="ten2",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000 netdata_net_carrier_state_average{chart="net_carrier.ten1",family="ten1",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000 netdata_net_operstate_state_average{chart="net_operstate.ten1",family="ten1",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000 netdata_net_duplex_state_average{chart="net_duplex.ten1",family="ten1",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000 netdata_net_carrier_state_average{chart="net_carrier.ten3",family="ten3",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000 netdata_net_operstate_state_average{chart="net_operstate.ten3",family="ten3",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000 netdata_net_duplex_state_average{chart="net_duplex.ten3",family="ten3",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000 ``` ### Expected behavior When: * link is down (was up) * link is up (was down) * link name is changed * link added * link removed Netdata should pick up the new links & remove removed links ### Steps to reproduce 1. Run machine with ethernet links 2. Setup machine network (`ip link set <ifname> up`) 3. Run netdata 4. Put link down (`ip link set <ifname> down`) Netdata should forget about this link ### Installation method native binary packages (.deb/.rpm) ### System info ```shell Linux ovirt-host4.opentech.local 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Wed Sep 15 15:39:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux /etc/centos-release:CentOS Linux release 8.3.2011 /etc/os-release:NAME="CentOS Linux" /etc/os-release:VERSION="8" /etc/os-release:ID="centos" /etc/os-release:ID_LIKE="rhel fedora" /etc/os-release:VERSION_ID="8" /etc/os-release:PLATFORM_ID="platform:el8" /etc/os-release:PRETTY_NAME="CentOS Linux 8" /etc/os-release:ANSI_COLOR="0;31" /etc/os-release:CPE_NAME="cpe:/o:centos:centos:8" /etc/os-release:CENTOS_MANTISBT_PROJECT="CentOS-8" /etc/os-release:CENTOS_MANTISBT_PROJECT_VERSION="8" /etc/redhat-release:CentOS Linux release 8.3.2011 /etc/system-release:CentOS Linux release 8.3.2011 ``` ### Netdata build info ```shell Version: netdata v1.34.1 Configure options: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--enable-plugin-freeipmi' '--enable-plugin-cups' '--with-bundled-libJudy' '--with-zlib' '--with-math' '--with-user=netdata' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' 'CXXFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' Install type: custom Features: dbengine: YES Native HTTPS: YES Netdata Cloud: YES ACLK Next Generation: YES ACLK-NG New Cloud Protocol: YES ACLK Legacy: NO TLS Host Verification: YES Machine Learning: YES Stream Compression: NO Libraries: protobuf: YES (system) jemalloc: NO JSON-C: YES libcap: YES libcrypto: YES libm: YES tcalloc: NO zlib: YES Plugins: apps: YES cgroup Network Tracking: YES CUPS: YES EBPF: NO IPMI: YES NFACCT: NO perf: YES slabinfo: YES Xen: NO Xen VBD Error Tracking: NO Exporters: AWS Kinesis: NO GCP PubSub: NO MongoDB: NO Prometheus Remote Write: YES ``` ### Additional info ```python ## netdata.conf # Ansible managed: netdata.j2 modified on 2022-05-26 11:00:41 by k0ste on Kunteynir.local # Do not edit manually [global] memory mode = save stock config directory = /etc/netdata/conf.d web files owner = root web files group = netdata memory deduplication (ksm) = no error log = stderr access log = none [plugins] checks = no node.d = no fping = no tc = no [plugin:proc:/proc/net/dev] bandwidth for all interfaces = yes packets for all interfaces = yes errors for all interfaces = yes drops for all interfaces = yes fifo for all interfaces = yes [plugin:proc:/proc/diskstats] path to device mapper = /dev/null performance metrics for disks with major 253 = yes [health] enabled = no [backend] enabled = no [statsd] enabled = no [host labels] ipmi_instance = 172.16.18.33 device_serial = 60NNWG2 device_rack = 2-6-17 device_position = U13 ```

Topic		Replies	Views
Netdata interface is not reachable after some time Help agent-health , agent	6	1781	December 2, 2020
Streaming metrics get dropped Help agent	12	1348	August 17, 2020
Unable to start netdata service Help agent-health , agent	7	5828	December 2, 2020
Cannot install netdata from source (the source directory does not include netdata-installer.sh). Leaving all files in /tmp/netdata-kickstart-RH8spBRm6P Help agent-installation , agent	13	2957	October 23, 2020
After upgrade to Ubuntu 20.04 LTS Netdata will not load in my browser Help agent	9	1668	October 5, 2020

Netdata losts downed network links

Related topics