I’ve used Netdata for a while, but am just starting with cloud. I’m not receiving any alerts for nodes that are down/unreachable. I’ve tried shutting down the instances to test, but Netdata never alerts me that the node is unreachable. I’ve received other alerts, so I believe things are properly configured. I’ve read that this feature is now generally available, but I’m not sure what I’m doing wrong. Thanks
Hey,
We know that it’s a bit convoluted, but as it is shown in the tooltip, you can only select context, not family.
Context is the group of charts (e.g net.net), while the family is the particular drive or network card.
I am pinging our product team @netdata-product, since this is tangential a feature request.
Hi @OdysLam
Thanks for the reply. I’m not sure I understand what I should be doing here—if the node were unreachable, no chart would be available to trigger an alarm, right?
I have this same issue. I have All alerts and unreachable configured for my user profile in Netdata cloud and I am receiving other warning/critical alerts just fine, but when a node enters an “unreachable” state I get no notification at all.
Node that I am running netdata on (and turning off to make unreachable) is running netdata v1.31.0-222-nightly.
@Mbrantley reported in https://community.netdata.cloud/t/revisit-unreachable-hosts-alarm-netdata-cloud/814/17 he has the same issue. We’re investigating and will reply here.
Hello @spiffytheseal could you please share with me at gparaskakis@netdata.cloud
- Your email address in the cloud.
- Your nodeId.
so we can further investigate the issue.
The comment I left describing this issue stopped shortly after making my post 3 months ago. I have been successfully getting unreachable emails for months, so I don’t think I have anything to submit to y’all. I will keep an eye out for this and submit any evidence should it resurface.
The cause of this as far as we can tell is that occasionally don’t receive an important message from VerneMQ (our MQTT server), when an agent disconnects. We call that message the Last Will and Testament (LWT) and it’s the signal to flag the agent and the nodes on that agent as offline. It’s also what triggers the notification. We’ve been working hard in the past few months to eliminate lost messages completely, but there are still some edge cases when it may happen. We have a cron job that does the reconciliation of the states, but it doesn’t currently trigger the notifications. This is a quick addition we’ll do, to ensure that the notification does go out eventually, even if it takes some minutes (until the job runs). Of course we won’t stop working to handle the edge cases as well.
If you face this issue, then you will very rarely miss a reachability notification. For example, it can’t be happening for the same node all the time. You will also be receiving other alerts just fine. If these two don’t apply to your case, then we need to look at it separately.
Overall, it doesn’t look like we had a significant change in the number of reachability notifications sent, as the chart below shows, so the issue has affected a few unlucky users. But it’s a very important feature that people can learn to rely on, so we won’t stop until it’s quite reliable.
I am also experiencing no reachability alerts from Netdata Cloud. Happy to provide more details if it’s helpful.
Yes, please provide more info. You’re sure you have enabled them, right? From your profile, for the specific space, that you are receiving alerts and reachability notifications?
Ah, thank you for the quick response - that screenshot resolved things for me. I was only looking at the space-level notifications configuration and not my individual profile notifications:
I didn’t see this in the documentation when I was searching earlier, though now I see there is a section in the middle of this page with more information: Alert notifications | Learn Netdata
I think the reason I didn’t find it earlier is I had mismatched expectations of how alerts are configured in Cloud. My initial instinct was to try and configure global notification settings so that others on our team don’t need to personalize their own alerts/think about the netadata config, they can just consume the output of the alerts and traige + resolve issues. Bascially a notification stream/channel that can be sent to an email list/slack channel/pagerduty rotation. I believe this behavior is possible on a per-node basis, but it seems like for now it’s not possible at the cloud/war room level.
It sounds like the team has some upcoming changes planned around this, so will stay tuned: Unreachable Netdata Cloud alerts to Slack/similar? - #5 by bo-d