Change unreachable timeout?

Ryan_S_Di_Francesco · August 11, 2021, 3:42am

Problem/Question

Is it possible to change the timeout setting which determines when Netdata Cloud considers a node as unreachable? I have a persistent issue where each day, sometimes multiple times a day, my entire fleet (~60 servers) are considered unreachable by Cloud, even though they are all still online and accessible. In every case the alert is cleared after 1 minute and Cloud sees them as reachable again. If there’s a way to adjust this timeout value, that may help with my troubleshooting. Thanks.

Christopher_Akritid1 · August 18, 2021, 5:36pm

This is a very interesting feature request. I expect you want this setting at a room level, not for each individual node, correct?

Ryan_S_Di_Francesco · August 18, 2021, 6:43pm

I hadn’t thought of it in that way. My initial thought was a setting in netdata.conf, something like:

cloud-timeout = 5min

which could be configured upon initial deployment of the agent. But, I recognize that Cloud may not be able to interact with netdata.conf in that way. So, room level would make sense, or even a global timeout in the CLoud GUI would be fine with me. I would not want to do this for each node.

In my experience with other monitoring platforms, I’ve been able to adjust the timeout period so I don’t get alerted to known events, like reboots. And in my original post, the unreachable events we experience daily may be due to networking issues which we have no control over. A longer timeout period would eliminate this noise. Thanks for considering.

OdysLam · August 20, 2021, 10:33am

Thanks for sharing this thought! It’s a good idea indeed!

Just to get some additional context, I assume that the use-case is IoT related? Perhaps a linux-powered board like Rpi?

Ryan_S_Di_Francesco · August 23, 2021, 7:43pm

Actually, the use case is just for traditional servers. The two scenarios I have in mind for an adjustable “unreacheable” timeout period:

Account for brief unplanned events like transient network issues that almost immediately resolve on their own. I suspect their is something happening at the network level, beyond our control, which causes Netdata to see all my hosts as unreachable and then reachable again 1 minute later.
Account for known events like server reboots. Yes, virtual machines can reboot fast, but our physical hardware takes several minutes. In other monitoring platforms, we’ve configured the “server down” timeout to 7 minutes to avoid unnecessary alerts from reboots.

OdysLam · August 24, 2021, 11:23am

Thanks @Ryan_S_Di_Francesco for this! Our product team has taken note and will reach out if they have further questions

Your feedback is actively making Netdata better !

spiffytheseal · September 7, 2021, 10:07am

I second this feature request for the same reasons. It would be really nice to custom configure a timeout.

bCyberBasti · December 19, 2021, 2:39pm

Any updates or workarounds for this? I am constantly waking up to over 30 email notifications about unreachable nodes. Those are edge IoT devices with a shitty connection, but I would love to set a higher timeout for these to catch nodes that are really unreachable for a longer period of time.

GeorgiaTs · December 30, 2021, 10:50am

Hi @bCyberBasti , sorry for the belated answer.
I’ve passed this request back to the product team, and this is something we are interested in implementing.
I’m afraid I don’t have an ETA to share with you, but I’ll keep you posted.

xkoevoet · August 29, 2023, 8:53am

Any info if this will be added in a near future?

Robert_Lanzke · November 2, 2023, 11:20am

Any news on that feature? This would be absolutely great!
Right now we are spammed by “unreachable” messages every day which really is a problem for us.
I ended up disabling “unreachable” notifications completely for now and implemented custom “heartbeat” checks for each and every server using https://healthchecks.io/

DJones · May 2, 2024, 9:53pm

I would really like a unreachable timeout setting for discord notifications. I schedule a reboot of my router once daily and it takes about 3-4 minutes to reboot. This causes discord notifications that the servers are offline; well I’m aware the connection is offline if I could adjust the timeout period then everything would be peachy.

luckman212 · June 8, 2024, 5:15pm

Add me to the pile of people who would find this useful!

Andrey_Arapov · July 2, 2024, 5:37pm

I’m in the same situation, managing about 50 servers monitored by Netdata with alerts configured to notify us via Slack. However, we’re frequently receiving false alerts indicating that nodes are unreachable (and then reachable again after few minutes). I’m looking for a solution to eliminate these false alarms.

sashwathn · July 3, 2024, 12:21pm

@Andrey_Arapov : The timeout for reachability notifications is 30 seconds. And this basically indicates one or more of the following scenarios:

Host is down or
Netdata Agent is down - this could also happen during nightly or stable release updates
Network between your host and Netdata Cloud is down for 30 seconds or more

You can disable the reachability notifications if you don’t intend to receive them. But we are working on a feature to allow users to configure timeouts for these notifications.

sirdvd · January 1, 2025, 7:08pm

In my case, I’ve various scheduled updates and reboot of switches and routers (that evidently take more than 30 seconds) and inevitably Netdata generate a ton of false alarms.

car12o · January 2, 2025, 4:50pm

This feature is already available at the space settings under Alerts & Notifications, Reachability tab.

Topic		Replies	Views
I'm receiving a storm of reachability alerts Help cloud	6	819	March 24, 2021
Not receiving alerts for unreachable node Help cloud-alarms , agent	10	3194	April 8, 2022
Netdata Server Up/Down Notification Help cloud	4	1448	June 15, 2022
node is unreachable SPAM Help cloud	2	254	July 25, 2023
Host is unreachable due to Snap updates Help agent , cloud , alerts	4	304	October 9, 2023

Change unreachable timeout?

Problem/Question

Related topics