I’ve used Netdata for a while, but am just starting with cloud. I’m not receiving any alerts for nodes that are down/unreachable. I’ve tried shutting down the instances to test, but Netdata never alerts me that the node is unreachable. I’ve received other alerts, so I believe things are properly configured. I’ve read that this feature is now generally available, but I’m not sure what I’m doing wrong. Thanks
We know that it’s a bit convoluted, but as it is shown in the tooltip, you can only select context, not family.
Context is the group of charts (e.g net.net), while the family is the particular drive or network card.
I am pinging our product team @netdata-product, since this is tangential a feature request.
Thanks for the reply. I’m not sure I understand what I should be doing here—if the node were unreachable, no chart would be available to trigger an alarm, right?
I have this same issue. I have All alerts and unreachable configured for my user profile in Netdata cloud and I am receiving other warning/critical alerts just fine, but when a node enters an “unreachable” state I get no notification at all.
Node that I am running netdata on (and turning off to make unreachable) is running netdata v1.31.0-222-nightly.
@Mbrantley reported in Revisit: "Unreachable hosts" alarm - Netdata Cloud - #17 by Mbrantley he has the same issue. We’re investigating and will reply here.
Hello @spiffytheseal could you please share with me at firstname.lastname@example.org
- Your email address in the cloud.
- Your nodeId.
so we can further investigate the issue.
The comment I left describing this issue stopped shortly after making my post 3 months ago. I have been successfully getting unreachable emails for months, so I don’t think I have anything to submit to y’all. I will keep an eye out for this and submit any evidence should it resurface.
The cause of this as far as we can tell is that occasionally don’t receive an important message from VerneMQ (our MQTT server), when an agent disconnects. We call that message the Last Will and Testament (LWT) and it’s the signal to flag the agent and the nodes on that agent as offline. It’s also what triggers the notification. We’ve been working hard in the past few months to eliminate lost messages completely, but there are still some edge cases when it may happen. We have a cron job that does the reconciliation of the states, but it doesn’t currently trigger the notifications. This is a quick addition we’ll do, to ensure that the notification does go out eventually, even if it takes some minutes (until the job runs). Of course we won’t stop working to handle the edge cases as well.
If you face this issue, then you will very rarely miss a reachability notification. For example, it can’t be happening for the same node all the time. You will also be receiving other alerts just fine. If these two don’t apply to your case, then we need to look at it separately.
Overall, it doesn’t look like we had a significant change in the number of reachability notifications sent, as the chart below shows, so the issue has affected a few unlucky users. But it’s a very important feature that people can learn to rely on, so we won’t stop until it’s quite reliable.