health api - reset specific silenced/disabled alerts

Problem/Question

I’m trying to coordinate silencing/disabling specific alerts during maintenance, and following the documentation (Health API Calls | Learn Netdata) the silencing/disabling part seems to be working fine; however i don’t see a way to reset a specific selector; only all alerts with the RESET cmd. Is it not possible to RESET only a specific alert/selector?

Relevant docs you followed/actions you took to solve the issue

Environment/Browser/Agent’s version etc

  • Netdata 1.46.3
  • (single) Parent

What I expected to happen

Starting multiple maintenance jobs across my infrastructure, each of which silences their own http/port checks. Once a specific maintenance job has completed, it removes its own silencing/disabling selector, but NOT any other silencing/disabling selector that would impact other maintenance jobs.

I am able to silence alarms using the SILENCE command. However, it needs to be done by chart name. I could not do the same based on alarm name.

Here is a snippet from my python code:
subprocess.run(["/usr/bin/curl", "-s", f"http://localhost:19999/api/v1/manage/health?cmd=SILENCE&chart={chart_name}&hosts={nodemac}", "-H", auth])

To unsilence it:
subprocess.run(["/usr/bin/curl", "-s", f"http://localhost:19999/api/v1/manage/health?cmd=RESET&chart={chart_name}&hosts={nodemac}", "-H", auth])

hosts={nodemac} portion is optional. You may or may not need that.

Oh and you also need to set the “auth” variable:

    auth = subprocess.run(["/bin/cat", "/var/lib/netdata/netdata.api.key"], stdout=subprocess.PIPE)
    auth = auth.stdout.decode('utf-8').strip()
    auth = f"X-Auth-Token: {auth}"

The location for the api key is defined in /etc/netdata/netdata.conf

Hope this helps.

Thanks for the reply :slight_smile:

That did help as it got me looking closer at my selectors, and think I have something now that does seem to work for the most part, but I’m still running into the issue regarding vestigial selectors remaining. Specifically, after I’ve done a cmd=RESET&chart=some_http_or_port_check&hosts=some_hostname and then a cmd=LIST I’ve found that the type is now None, but the silencer remains. In my testing, I’ve found that if any other job adds a silencer, this vestigial selector will get picked up and if that service goes down there will be no notification. I think that’s why there is the warning message of WARNING: Added alarm selector to silence/disable alarms without a SILENCE or DISABLE command. when performing the reset.

Honestly, I think this is a bug; unless I’m missing something?

Further, testing shows that if you have 2 jobs, one longer than the other, that both start at the same time, as soon as the shorter job finishes it resets alerts for everybody; thus triggering alerts that were intended to be silenced.

Upon further reflection, I don’t think it’s a bug, but rather a feature request. Is there a way to move this post to a better place, or should I re-post it somewhere else?

created new feature request in Github: [Feat]: health api - reset specific silenced/disabled alerts · Issue #18341 · netdata/netdata · GitHub