Is it possible to change the timeout setting which determines when Netdata Cloud considers a node as unreachable? I have a persistent issue where each day, sometimes multiple times a day, my entire fleet (~60 servers) are considered unreachable by Cloud, even though they are all still online and accessible. In every case the alert is cleared after 1 minute and Cloud sees them as reachable again. If there’s a way to adjust this timeout value, that may help with my troubleshooting. Thanks.
I hadn’t thought of it in that way. My initial thought was a setting in netdata.conf, something like:
cloud-timeout = 5min
which could be configured upon initial deployment of the agent. But, I recognize that Cloud may not be able to interact with netdata.conf in that way. So, room level would make sense, or even a global timeout in the CLoud GUI would be fine with me. I would not want to do this for each node.
In my experience with other monitoring platforms, I’ve been able to adjust the timeout period so I don’t get alerted to known events, like reboots. And in my original post, the unreachable events we experience daily may be due to networking issues which we have no control over. A longer timeout period would eliminate this noise. Thanks for considering.
Actually, the use case is just for traditional servers. The two scenarios I have in mind for an adjustable “unreacheable” timeout period:
Account for brief unplanned events like transient network issues that almost immediately resolve on their own. I suspect their is something happening at the network level, beyond our control, which causes Netdata to see all my hosts as unreachable and then reachable again 1 minute later.
Account for known events like server reboots. Yes, virtual machines can reboot fast, but our physical hardware takes several minutes. In other monitoring platforms, we’ve configured the “server down” timeout to 7 minutes to avoid unnecessary alerts from reboots.
Any updates or workarounds for this? I am constantly waking up to over 30 email notifications about unreachable nodes. Those are edge IoT devices with a shitty connection, but I would love to set a higher timeout for these to catch nodes that are really unreachable for a longer period of time.
Hi @bCyberBasti , sorry for the belated answer.
I’ve passed this request back to the product team, and this is something we are interested in implementing.
I’m afraid I don’t have an ETA to share with you, but I’ll keep you posted.
Any news on that feature? This would be absolutely great!
Right now we are spammed by “unreachable” messages every day which really is a problem for us.
I ended up disabling “unreachable” notifications completely for now and implemented custom “heartbeat” checks for each and every server using https://healthchecks.io/
I would really like a unreachable timeout setting for discord notifications. I schedule a reboot of my router once daily and it takes about 3-4 minutes to reboot. This causes discord notifications that the servers are offline; well I’m aware the connection is offline if I could adjust the timeout period then everything would be peachy.
I’m in the same situation, managing about 50 servers monitored by Netdata with alerts configured to notify us via Slack. However, we’re frequently receiving false alerts indicating that nodes are unreachable (and then reachable again after few minutes). I’m looking for a solution to eliminate these false alarms.
@Andrey_Arapov : The timeout for reachability notifications is 30 seconds. And this basically indicates one or more of the following scenarios:
Host is down or
Netdata Agent is down - this could also happen during nightly or stable release updates
Network between your host and Netdata Cloud is down for 30 seconds or more
You can disable the reachability notifications if you don’t intend to receive them. But we are working on a feature to allow users to configure timeouts for these notifications.
In my case, I’ve various scheduled updates and reboot of switches and routers (that evidently take more than 30 seconds) and inevitably Netdata generate a ton of false alarms.