Netdata Community

Revisit: "Unreachable hosts" alarm - Netdata Cloud

Hi

A repeat of: "Unreachable hosts" alarm - Netdata Cloud which has been closed for some reason.

My question:

As per the bottom half of this GitHub thread https://github.com/netdata/netdata/issues/9121 it was announced early November that the feature of “Having notifications for unreachable hosts is a priority” would be shipped within a few days.

Is there any update for this feature, as receiving an alarm (e.g. email), when a node goes down entirely would be a massive benefit, as otherwise I still need another monitoring tool for that.

To which @OdysLam replied, on 8th December:

We are actually doing the finishing touches on this one, so you can expect it reasonably soon (weeks). We are glad that you find it that useful!

Is there any further update to this? As per the above, I was expecting it within a few weeks of early December, but have not seen anything yet.

Hey,

We shipped Centralized Alarm notifications just a few weeks ago. (Documenation).

This feature includes email notifications of unreachable nodes.

In case you have tried Netdata Cloud and doesn’t function as expected, I would love to help you get it going.

Cheers :slight_smile:

1 Like

Yay, thanks! I saw the newsletter announcing centralised alarm notifications, but missed this ‘unreachable’ feature.

Just turn them on and tested and worked a treat.

Thanks

1 Like

Hey @OdysLam

Is there a way to confirm this ‘unreachable’ alarm with a wait /delay time?

I’ve had a few red herrings in that the node becomes unreachable for a very short time, so receive the email alert, but in reality, the server/site hasn’t gone down.

So being able to set the ‘unreachable’ alarm to send the email if it’s been unreachable for {x} minutes would be great.

Is this possible?

Thanks
Rob

Hey @RunRiot ,

I am glad that you like it. Although I am pretty sure that this is not an option currently, let me ping our product folks and see what they think about this. It’s an interesting idea.

cc @Manos_Saratsis @Thanos_Karachalios @dimi

@RunRiot thank you for the feedback. I added it as a comment to an existing bug we have about the unreachable state correctness Incorrect reachability status · Issue #3 · netdata/netdata-cloud · GitHub

1 Like

I don’t know if anyone else has this issue too, but ever since the reachability issues, I’ve had orphaned alarms on the cloud platform: i.e. alerts which are not active on the local agent but appear in the cloud dashboard with several hundred hours active time.

1 Like

@OdysLam @Manos_Saratsis

Thanks. The current level of sensitivity for the alerts is far too sensitive; I had numerous false alarms over the weekend, to the extent that I’m considering disabling it entirely.

Please please allow for a wait / delay setting. All other ‘down time’ checkers out there have this feature, and it’s essential.

Is it something likely to be added in the near future?

1 Like

Hey @Luis_Johnstone,

This is a known bug that we are working towards fixing. We understand that it’s confusing, so we have set a high priority!

Thanks @RunRiot for that feedback. The @netdata-product team is aware and we take your comments very seriously.

They will use this thread to communicate further. Thank you for your patience and please keep the feedback coming. Not only we are listening, but it’s essential for us.

Best,
Odysseas

Thanks for the feedback. I can only add that the issue is occurring very often. We several additional cases after the one cited in this thread. It’s probably the most painful aspect of Netdata so far. The only workaround we found, so far, is claiming the node again.

@OdysLam I am getting the email when the sever is unreachable but not Slack and Telegram notifications. It works for CPU and RAM.

Is this a feature yet to implemented in Slack or Telegram or are there any settings I have to do ?

1 Like

Is there a way to disable the reachability alarms? The only thing I see that seems to be related is in fping.conf. I have hosts that are periodically off the network and don’t need alarms for them.

1 Like

Hey @sandy and welcome to our community. I am so sorry to respond so lately. We are still working on more options for the centralized alarms feature. For now, only email alarms are available.

You can still configure your agent to send alarms with the existing functionality (e.g telegram), but you will have to do this in a per-agent basis.

@lordpengwin welcome to our community too :slight_smile: At this moment, this option does not allow for any customizability. We are working towards giving more alarms and options.

That’s an interesting use-case though I think @netdata-product. Disable the reachability alarms for some nodes, as they are expected to go offline for a period of time. This is also highly relevant to the domain of IoT.

Unreachable agent alarms are triggered when an agent’s connection to the cloud is lost. I think there are two feature requests that can come from this idea, in addition to adding Slack and Telegram notifications to the cloud:

  • Trigger an alarm on the agent, when its connection to the cloud is lost.
  • Trigger an alarm on a parent agent, when a child node is disconnected from it.
    If you think that’s either of these is a good idea, you could create relevant feature requests in the forum.

The difficulties with both of these are:

  • The agents currently only create alarms based on metrics, not other conditions.
  • The triggering logic may suffer from the same issues he have on the cloud. When is an agent disconnection a normal situation based on a specific infrastructure? An option to completely disable reachability alarms is obviously one solution to this.

Overall, as we are currently fully refactoring how alarms work on the cloud, in order to make them more reliable, it will take a couple of months until we add new features. Some workarounds that you may consider, based on your use case, are the following:

  • Stream your metrics to a parent, that’s the only one running health. This way, you’d only have a single place where you could disable, or modify a particularly problematic alarm. Doesn’t help with the reachability alarms.
  • Create some email filters, so that e.g. reachability alarms get filtered out, if they aren’t useful to you.
  • Disable cloud alarm notifications and instead configure notifications on each agent (you lose reachability alarms).

I’m having the exact opposite issue as most folks, where my reachability emails are not happening at all. Is there really no place to adjust settings for these emails? Getting notified that infrastructure is completely offline is far more important than knowing that /dev/sdc is 5% more active than its median!

I specifically just spun up a linux VM, installed Netdata, claimed the node, waited for graphs to appear, then turned the VM off. It’s been 20 minutes with no email? I’m assuming 20 minutes is long enough that the “reachability email” should have been fired off?

I DO have email notifications turned ON in Manage Space → Notifications.
In my profile I have alarms turned on for “All alarms and unreachability”