Reachability alerts not particularly reliable

Over a few months now I have been observing false unreachable notifications. Mixed hosts, mixed client versions. It seems that, based on some network conditions maybe, reachability tests are not particularly stable.

  1. It does seem like hosts being online, with netdata agent working (confirmed no errors, reachable locally and service up) are reported offline, at seemingly random times/random intervals.
  2. I often observe receiving not reachable notifications (webhooks), with reachable again webhooks MIA, or with significant delay (>10min)

Sometimes I am logged in onto the host, just to receive an alert about host being unreachable…

Is there something I am missing, a tunable that could prevent these alerts?

Thats known issue Im affraid. Nothing has been done to solve this for well over a year.

If you are not new to coding, you may just open PR on GitHub. Otherwise, if you are plain user, not much can be done.

You could provide some of the networking variables that you suspect might be involved- that might help start to build a picture of the issue. I, for example, never see the kind of issue that you are experiencing with nodes on-prem and in Azure. My hosts are all in the UK South region. I know Netdata’s ingress isn’t globally-distributed yet and so maybe there’s something in that?

You could also raise an issue on Github with some specific time-stamps so that perhaps the Netdata SRE team can look at what’s going on.