Change unreachable timeout?

Problem/Question

Is it possible to change the timeout setting which determines when Netdata Cloud considers a node as unreachable? I have a persistent issue where each day, sometimes multiple times a day, my entire fleet (~60 servers) are considered unreachable by Cloud, even though they are all still online and accessible. In every case the alert is cleared after 1 minute and Cloud sees them as reachable again. If there’s a way to adjust this timeout value, that may help with my troubleshooting. Thanks.

3 Likes

This is a very interesting feature request. I expect you want this setting at a room level, not for each individual node, correct?

I hadn’t thought of it in that way. My initial thought was a setting in netdata.conf, something like:

cloud-timeout = 5min

which could be configured upon initial deployment of the agent. But, I recognize that Cloud may not be able to interact with netdata.conf in that way. So, room level would make sense, or even a global timeout in the CLoud GUI would be fine with me. I would not want to do this for each node.

In my experience with other monitoring platforms, I’ve been able to adjust the timeout period so I don’t get alerted to known events, like reboots. And in my original post, the unreachable events we experience daily may be due to networking issues which we have no control over. A longer timeout period would eliminate this noise. Thanks for considering.

Thanks for sharing this thought! It’s a good idea indeed!

Just to get some additional context, I assume that the use-case is IoT related? Perhaps a linux-powered board like Rpi?

Actually, the use case is just for traditional servers. The two scenarios I have in mind for an adjustable “unreacheable” timeout period:

  1. Account for brief unplanned events like transient network issues that almost immediately resolve on their own. I suspect their is something happening at the network level, beyond our control, which causes Netdata to see all my hosts as unreachable and then reachable again 1 minute later.
  2. Account for known events like server reboots. Yes, virtual machines can reboot fast, but our physical hardware takes several minutes. In other monitoring platforms, we’ve configured the “server down” timeout to 7 minutes to avoid unnecessary alerts from reboots.
2 Likes

Thanks @Ryan_S_Di_Francesco for this! Our product team has taken note and will reach out if they have further questions :slight_smile:

Your feedback is actively making Netdata better !

I second this feature request for the same reasons. It would be really nice to custom configure a timeout.

Any updates or workarounds for this? I am constantly waking up to over 30 email notifications about unreachable nodes. Those are edge IoT devices with a shitty connection, but I would love to set a higher timeout for these to catch nodes that are really unreachable for a longer period of time.

1 Like

Hi @bCyberBasti , sorry for the belated answer.
I’ve passed this request back to the product team, and this is something we are interested in implementing.
I’m afraid I don’t have an ETA to share with you, but I’ll keep you posted.

Any info if this will be added in a near future?

Any news on that feature? This would be absolutely great!
Right now we are spammed by “unreachable” messages every day which really is a problem for us.
I ended up disabling “unreachable” notifications completely for now and implemented custom “heartbeat” checks for each and every server using https://healthchecks.io/