Netdata losts downed network links

We have two LACP links on host:

  • bond0: ten0 + ten1 physdev’s
  • bond2: ten2 + ten3 physdev’s

The case: some TX issues on switch-side SFP+ transceiver. The problem link should be ten3, to troubleshot it I was set ten2 interface down:

ip link set ten2 down

The kernel:

[Thu Sep 15 14:23:50 2022] i40e 0000:04:00.0 ten2: NIC Link is Down
[Thu Sep 15 14:23:50 2022] bond2: (slave ten2): link status definitely down, disabling slave
[Thu Sep 15 14:23:50 2022] bond2: active interface up!

Then netdata lost this link for ever:

Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 duplex state by reading '/sys/class/net/ten2/duplex'. I will stop updating it. (errno 22, Invalid argument)
Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 carrier state by reading '/sys/class/net/ten2/carrier'. Stop updating it. (errno 22, Invalid argument)
Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 speed by reading '/sys/class/net/ten2/speed'. Will not update its speed anymore. (errno 22, Invalid argument)

When we replaced SFP+ transceiver for ten3 link and fixed the packet loss issues, I was set up ten2 link back:

[Thu Sep 15 20:04:41 2022] 8021q: adding VLAN 0 to HW filter on device ten2
[Thu Sep 15 20:04:41 2022] i40e 0000:04:00.0: ten2 is entering allmulti mode.
[Thu Sep 15 20:04:41 2022] bond2: (slave ten2): link status definitely down, disabling slave
[Thu Sep 15 20:04:42 2022] i40e 0000:04:00.0 ten2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Sep 15 20:04:42 2022] bond2: (slave ten2): link status definitely up, 10000 Mbps full duplex

But netdata is not catch up this link. Only solution I was find is a restart netdata:smiling_face_with_tear:

[root@ovirt-host4 ~]# curl -Ss "http://localhost:19999/api/v1/allmetrics?format=prometheus_all_hosts&server=localhost&source=average&variables=yes" 2>&1 | grep -E 'ten' | grep -E 'operstate|duplex|carrier'
netdata_net_carrier_state_average{chart="net_carrier.ten0",family="ten0",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten0",family="ten0",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten0",family="ten0",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten2",family="ten2",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_carrier_state_average{chart="net_carrier.ten1",family="ten1",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten1",family="ten1",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten1",family="ten1",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000
netdata_net_carrier_state_average{chart="net_carrier.ten3",family="ten3",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten3",family="ten3",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten3",family="ten3",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000

May be possible to tune netdata.conf for avoid this issue?

[root@ovirt-host4 ~]# netdata -V
netdata v1.34.1

Hi, @k0ste. Open a bug report.

1 Like