I’m trying to coordinate silencing/disabling specific alerts during maintenance, and following the documentation (Health API Calls | Learn Netdata) the silencing/disabling part seems to be working fine; however i don’t see a way to reset a specific selector; only all alerts with the RESET cmd. Is it not possible to RESET only a specific alert/selector?
Relevant docs you followed/actions you took to solve the issue
Starting multiple maintenance jobs across my infrastructure, each of which silences their own http/port checks. Once a specific maintenance job has completed, it removes its own silencing/disabling selector, but NOT any other silencing/disabling selector that would impact other maintenance jobs.
I am able to silence alarms using the SILENCE command. However, it needs to be done by chart name. I could not do the same based on alarm name.
Here is a snippet from my python code: subprocess.run(["/usr/bin/curl", "-s", f"http://localhost:19999/api/v1/manage/health?cmd=SILENCE&chart={chart_name}&hosts={nodemac}", "-H", auth])
To unsilence it: subprocess.run(["/usr/bin/curl", "-s", f"http://localhost:19999/api/v1/manage/health?cmd=RESET&chart={chart_name}&hosts={nodemac}", "-H", auth])
hosts={nodemac} portion is optional. You may or may not need that.
That did help as it got me looking closer at my selectors, and think I have something now that does seem to work for the most part, but I’m still running into the issue regarding vestigial selectors remaining. Specifically, after I’ve done a cmd=RESET&chart=some_http_or_port_check&hosts=some_hostname and then a cmd=LIST I’ve found that the type is now None, but the silencer remains. In my testing, I’ve found that if any other job adds a silencer, this vestigial selector will get picked up and if that service goes down there will be no notification. I think that’s why there is the warning message of WARNING: Added alarm selector to silence/disable alarms without a SILENCE or DISABLE command. when performing the reset.
Honestly, I think this is a bug; unless I’m missing something?
Further, testing shows that if you have 2 jobs, one longer than the other, that both start at the same time, as soon as the shorter job finishes it resets alerts for everybody; thus triggering alerts that were intended to be silenced.
Upon further reflection, I don’t think it’s a bug, but rather a feature request. Is there a way to move this post to a better place, or should I re-post it somewhere else?