I have an entropy alarm that is over 850 hrs old. I fixed this ages ago. How long does it take until that alarm goes away and I never have to worry about it again?
Is there not some way in netdata cloud to acknowledge an alarm and clear it?
I have an entropy alarm that is over 850 hrs old. I fixed this ages ago. How long does it take until that alarm goes away and I never have to worry about it again?
Is there not some way in netdata cloud to acknowledge an alarm and clear it?
Hey @Greg_Munro and welcome to our forums!
This is a known bug and we are working towards fixing it. TBH we are currently reworking a lot of functionality on the backend, so this is why it has taken time.
Regarding Incident management (silence alarms and acknowledge an incident), is on our Roadmap, but I can’t share any timeline for the implementation. Perhaps the @netdata-product team might be able to share more details
If you have any other questions, please feel free to share. We are here to help!
@Greg_Munro Thanks for using Netdata. We are aware of this issue and we will address it. There is an open bug in Github
@Manos_Saratsis Is there any update? I currently have an alarm stuck for 25 days.
@Grboy : Do you see this issue still? Can you provide us more details of your Agent version?
Yes, I still have this issue. The alarm was stuck 28 days ago. Netdata agent version is - v1.33.1
Thank you for the information. Can you confirm if you have protobuf enabled on your agent and have migrated to the new architecture?
Can you please send us a snapshot of this command:
/usr/sbin/netdatacli aclk-state
Can you confirm if you have protobuf enabled on your agent and have migrated to the new architecture?
How can I check that?
Btw, I have sent the command output to your DM.
You are indeed on the new architecture and this seems to be a case of some old alerts stuck during the migration (most likely).
I will ask my team to look at this. Can you also confirm if the same alert is also seen on your agent - localhost:19999 (or if you have changed these for your agent)?
We are fixing these stuck Alerts issues in various forms and this PR is a related one and should be fixed soon - [BUG] Stuck alerts in Netdata Cloud for agents that are no longer live. · Issue #217 · netdata/netdata-cloud · GitHub
Thanks for your feedback!
These alerts are not visible on my localhost:19999 agent
Thanks again for the confirmation. This issue is being worked on and should be solved soon.
Sorry to chime in on a closed issue but I found this while trying to find a way to clear defunct alarms. I see there hasn’t been any activity in quite awhile but it appears that I am having the same issue. As far as the agents are concerned the alarms don’t exist. Trying to click and follow them in the cloud interface leads to a page stating that chart/metric doesn’t exist (although the metrics do exist and are continuing to update.
In my case, I have suddenly (over the past couple of weeks) been getting constant floods of anomaly alarms that will almost immediately clear. They vary as to the metric they are alarming on. Usually ip, ram, hardirq etc.
The problem is that at one point while there were 60 or so alarms in the cloud, I was performing maintenance on all of the children (15 or so) as well as the parent. During this time, all the agents were updated and the servers were restarted.
The restarts were for resource increases and I think that briefly there was high RAM usage and Netdata may have been killed…unfortunately this may have also happened while it was temporarily unable to write to the disk as well. The reboots should’ve been clean and it appeared that way. Netdata seemed to start normally on the parent and children but ever since they came back up the cloud interface is still persisting the alerts (albeit with no actual values listed for the metric in the alerts tab list) while the children and parent nodes show no alarms and seem to be otherwise running normally…
Is there any news on this or way to clear the alarms out of the cloud interface? Restarting the agents doesn’t seem to do anything except change the “Triggered” time to when the agent was restarted.