We have two LACP links on host:
-
bond0
:ten0
+ten1
physdev’s -
bond2
:ten2
+ten3
physdev’s
The case: some TX issues on switch-side SFP+ transceiver. The problem link should be ten3
, to troubleshot it I was set ten2
interface down:
ip link set ten2 down
The kernel:
[Thu Sep 15 14:23:50 2022] i40e 0000:04:00.0 ten2: NIC Link is Down
[Thu Sep 15 14:23:50 2022] bond2: (slave ten2): link status definitely down, disabling slave
[Thu Sep 15 14:23:50 2022] bond2: active interface up!
Then netdata
lost this link for ever:
Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 duplex state by reading '/sys/class/net/ten2/duplex'. I will stop updating it. (errno 22, Invalid argument)
Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 carrier state by reading '/sys/class/net/ten2/carrier'. Stop updating it. (errno 22, Invalid argument)
Sep 15 14:23:51 ovirt-host4.opentech.local netdata[72766]: 2022-09-15 14:23:51: netdata ERROR : PLUGIN[proc] : Cannot refresh interface ten2 speed by reading '/sys/class/net/ten2/speed'. Will not update its speed anymore. (errno 22, Invalid argument)
When we replaced SFP+ transceiver for ten3
link and fixed the packet loss issues, I was set up ten2
link back:
[Thu Sep 15 20:04:41 2022] 8021q: adding VLAN 0 to HW filter on device ten2
[Thu Sep 15 20:04:41 2022] i40e 0000:04:00.0: ten2 is entering allmulti mode.
[Thu Sep 15 20:04:41 2022] bond2: (slave ten2): link status definitely down, disabling slave
[Thu Sep 15 20:04:42 2022] i40e 0000:04:00.0 ten2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Sep 15 20:04:42 2022] bond2: (slave ten2): link status definitely up, 10000 Mbps full duplex
But netdata
is not catch up this link. Only solution I was find is a restart netdata
[root@ovirt-host4 ~]# curl -Ss "http://localhost:19999/api/v1/allmetrics?format=prometheus_all_hosts&server=localhost&source=average&variables=yes" 2>&1 | grep -E 'ten' | grep -E 'operstate|duplex|carrier'
netdata_net_carrier_state_average{chart="net_carrier.ten0",family="ten0",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten0",family="ten0",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten0",family="ten0",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten2",family="ten2",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_carrier_state_average{chart="net_carrier.ten1",family="ten1",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten1",family="ten1",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten1",family="ten1",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000
netdata_net_carrier_state_average{chart="net_carrier.ten3",family="ten3",dimension="carrier",instance="ovirt-host4.opentech.local"} 1.0000000 1663248091000
netdata_net_operstate_state_average{chart="net_operstate.ten3",family="ten3",dimension="state",instance="ovirt-host4.opentech.local"} 6.0000000 1663248091000
netdata_net_duplex_state_average{chart="net_duplex.ten3",family="ten3",dimension="duplex",instance="ovirt-host4.opentech.local"} 2.0000000 1663248091000
May be possible to tune netdata.conf
for avoid this issue?