This morning i started to get alerts for every node on my network that they are unreachable then a minute or two later they are reachable again. Since it is from every node I would assume that my internet connection is having problems but I see no sign of this. I can go to any website including netdata.cloud, I can ping sites, internet speed tests seem good and so on.
I tried restarting netdata on all my nodes and notice that it has been running a month almost to the hour, i don’t know if this is related.
Is there anyway to debug why netdata is reporting unreachability or to tune the parameters that it uses to decide?
Yes sorry, I use Netdata Cloud. I have a variety of nodes registered, a couple of raspberry pi’s, a intel Linux VM and a Docker container running on may NAS. All started reporting reachability problems around 7am EDT this morning. I get the alerts via email.
I am so sorry for this flood of emails. We are having intermittent issues with our Netdata Cloud backend, which loses connection with the Netdata Agents and thus believes that they are out of reach.
We have https://status.netdata.cloud but it’s not logged there because it’s not an incident. It’s a known bug that we are working to root cause and fix it.
In reality, we are re-working a lot of our backend, so we expect that the overall stability and refinement will improve in the following months.
Finally, I just got a notice that we did some restarts of our backend services, so this might be related to that too.
For anyone reading this, we have integrated our status page into this very forum, so in case of an incident, a popup will appear that will inform you about it.
So, no worries, with your visit in our community, you also ensure that no incident is underway that you are not aware of