A lot of old Alerts without any Value

Hi there,

for us the Alert Overview of the cloud interface is filled with alters that happened a long time ago and don’t report any data. Is this intentional?

2 Likes

hey @Slind14,

Thanks for reporting this. I believe that this is a bug, as there are a couple of GH issues that seem relevant but I will ping the Netdata CLoud team to verify and share the appropriate bug issue so you can track the progress (if it is indeed a bug).

Welcome back to the family :slight_smile:

Cheers!

Hey @Slind14, thanks for reaching out to us.

This is related to a known issue Incorrect active alarms in Cloud · Issue #4 · netdata/netdata-cloud · GitHub. We are working on a solution, you can follow the issue on github for updates.

A workaround for now to cleanup the old/stale alarms is to restart your agent.

@Leonidas_Vrachnis well I’m experiencing this issue for a few months now and also had commented on the GitHub issue. Problem being, your suggested workaround doesn’t help unfortunately. Restarting the agent still keeps the alerts visible in the cloud, but they disappear from the API if I e.g. call them with curl http://pcweb1:19999/api/v1/alarms

This has been ongoing for us for many months as well.

This doesn’t seem to work.
Old alerts are even persisting re-installation of the agent.

I found this while trying to find a way to clear defunct alarms. I see there hasn’t been any activity in quite awhile but it appears that I am having the same issue. As far as the agents are concerned the alarms don’t exist. Trying to click and follow them in the cloud interface leads to a page stating that chart/metric doesn’t exist (although the metrics do exist and are continuing to update.

In my case, I have suddenly (over the past couple of weeks) been getting constant floods of anomaly alarms that will almost immediately clear. They vary as to the metric they are alarming on. Usually ip, ram, hardirq etc.

The problem is that at one point while there were 60 or so alarms in the cloud, I was performing maintenance on all of the children (15 or so) as well as the parent. During this time, all the agents were updated and the servers were restarted.

The restarts were for resource increases and I think that briefly there was high RAM usage and Netdata may have been killed…unfortunately this may have also happened while it was temporarily unable to write to the disk as well. The reboots should’ve been clean and it appeared that way. Netdata seemed to start normally on the parent and children but ever since they came back up the cloud interface is still persisting the alerts (albeit with no actual values listed for the metric in the alerts tab list) while the children and parent nodes show no alarms and seem to be otherwise running normally…

Is there any news on this or way to clear the alarms out of the cloud interface?

Hi @DevNull , sorry for the late reply and welcome!

We’re really interested in this case. It’s an ongoing effort to get the alerts to be consistent in agent & cloud.

The case where the agent could not write to the disk could lead to such cases. Do you remember if during that time you where running a version lower than 1.35?

So, you have a setup of a parent and ~15 children? Is only the parent claimed to the cloud? Can you please share the output of http://localhost:19999/api/v1/info of the parent to manolis at netdata.cloud?

We will check the current state and clear the alerts that no longer are active.

Thanks a lot!

Hi Manolis!

It was fairly recent. Looking at the dpkg logs it appears the upgrade from 1.34 → 1.35 was on 06/14. I’ll look and see if I can find/remember exactly when I noticed the alerts not clearing. Also, not sure if it is relevant or not but shortly before this issue started there were a few days where there were intermittent dbengine.flush alerts (that cleared almost immediately).

They are all claimed to the cloud.

Sent.

Thank you!

-Ronan

Hi @DevNull !

Can you check please what the current state of alerts is on the cloud? Thanks!

Hey Manolis!

Sorry it took so long to get back. Got lost in the email pile lol. It looks like whatever you did cleared the stuck ones. Thank you!!!

Now it should be easier for me to get in there and see if I can figure out what the cause is of all of these anomalies that were stuck in there :slight_smile:

Thanks!

-Ronan

Hey Manolis!

Looks like it’s doing it again. Lots of anomaly alarms being triggered and then being cleared locally but still showing in cloud. I can see in the logs where the “CLEAR” is sent. The parent seems to clear them fine and the alerts are removed from the alerts tab as expected.

They list a “Triggered Value”, are completely blank for “Latest Updated” and “-” for “Latest Value”.

Thanks,
-Ronan

Hi @DevNull ,

Sorry to ear that this is happening again, we are currently doing some refactoring work on this area that we hope will solve this for good. You can follow thig bug which we will close once that work is completed [BUG]: agent and cloud have out of sync alarm log · Issue #330 · netdata/netdata-cloud · GitHub

In the mean time, if you provide us with your spaceID we would be able to force the syncing. Please share it to hugo@netdata.cloud

Hi Ronan! Sorry it took a while to get back.

One thing to check: Can you please send a screenshot from the alert page, then another one when you click on one of those alerts, and go down to the part “Instance Values - Node Instances” ?

What I’d like to check is what node these alerts refer to and what node raised them. Since both your parent and children are claimed, these alerts can come from either the parent or a child, would like to check that.

What would give more info is your “netdata-meta.db” under /var/cache/netdata directory, but I’d need it close when this problem occurs, since data in there will be rotated… In any case, is it possible to share it (along with any netdata-meta.db-shm & netdata-meta.db-wal if they exists) to manolis@netdata.cloud? If it’s been just a few days there might be something useful there.

Thanks a lot!

Hi!. We’ve discovered a case where we could get into this situation. It involves alerts that are cleared/raised right before the agent stops and had no time to send to the cloud. I’m not sure this is the case you’re experiencing, but we should have a fix in the next couple of days for this and see if it helps. We will clear again your nodes, and see from there. Of course the work @hugo mentioned will be done as well.

Thanks!

1 Like

Manolis,

Since you guys cleared the alerts the other day, none of them have stuck in there again.

As far as what the alerts were, 100% of them were anomalies and they would be almost immediately cleared.

The children were sending the alert and the CLEAR to both the parent and cloud. The parent was also sending the alert and CLEAR to the cloud. I would see them in the logs but the CLEAR usually didnt’ seem to actually clear on the cloud side.

How often is the data rotated in netdata-meta.db*?
Do you need copies from all the children and the parent?

Maybe there is still some useful info in there…Or possibly I could retrieve them for you from a previous filesystem snapshot.

-Ronan

Hi Ronan!

Sorry, I didn’t get it, so there’s no stuck alarms there now in the cloud?

We’re in the process if issuing the fix that I described above, then if you still have stuck alerts we’ll clear them, then we can continue monitoring how it goes. The fix should handle such cases where alerts rapidly change status.

I’ll let you know for the db files, since children and the parent report alerts to the cloud it could be tricky to find something. db files are rotated every 2000 entries approximately.

Will let you when we have the fix on production (it is on the cloud side), and we’ll keep a look. Thanks again!!

Hi Ronan!

We’ve cleared your list, and implemented a fix for the case I mentioned a few posts above.

Do let us know how it goes, will also let you know when we implement the big change in protocol for alerts.

Thanks!

1 Like

Manolis,

Thanks for the update! So far the alerts have been operating as expected since you guys cleared them a little over a week ago.

I’ll let you know if there are any issues.

Thanks again guys!

-Ronan

Hi guys,

Not sure if you all are still having to manually clear out those alerts or not. (From time to time they do get a little crazy but seem to clear up)

Just wanted to see if anything that was done or is being done to that space in cloud would cause this new really weird issue I’m having where in some areas of the cloud it appears that characters are missing/omitted from the node name as well as chartid (in alarms tab) as well as the node name in the individual node tabs.

I know it’s not a lot of info and doesn’t make much sense. I tried to explain it better here…

https://community.netdata.cloud/t/missing-letters-in-node-and-chart-id-in-alerts-tab-as-well-as-in-the-tab-titles-when-title-is-a-node-in-netdata-cloud/3670

Thanks,

-Ronan