Alert status escalation

Richard_Barrantes · November 27, 2024, 5:27pm

Hello Team,

We would like to enhance an alert for monitoring PostgreSQL replication slots by raising a warning if the condition persists for 6 hours and escalating to critical if it persists for 24 hours, based on the following alert configuration:

 template: Alerts_Inactive_Replication_Slots
       on: Alerts.Inactive_Replication_Slots
   lookup: average -5m unaligned of Inactive
    units: slots
    every: 60s
     warn: $this >= (1)
    delay: up 6h
  summary: Inactive Replication Slots
     info: Inactive Replication Slots
       to: dba

Can you help us to find a solution?

Regards,
Richard Barrantes

car12o · December 3, 2024, 4:41pm

Hello @Richard_Barrantes,

The configuration you shared is pretty close to what you desire, it just needs small changes.

If you want the alert to check on a time window you need change the query (lookup field) . Changing delay field would just delay the notification but the alert will still trigger for the average of the last 5min.

Since query a big time window is a heavy query, this also bypasses tiering and requests high resolution data from tier 0 (per second) not tier 1 (per min) or tier 2 per hour, we may want to do not check the alert that often (every 60s). Let’s run the alert check every 15min for example by changing the every field.

github.com/netdata/netdata

Run health queries from tier 0

netdata:master ← MrZammler:query-min-lookup

opened 08:21AM - 22 Sep 23 UTC

MrZammler

+1 -1

##### Summary  This PR forces the use of tier 0 when doing health queries. In some cases, e.g. when the lookup period is big, the query planer might pick another tier to execute the query. This could lead to unexpected results. Of course this change can slightly affect cpu usage, but in same levels as when before tiering was introduced. ##### Test Plan  Testing of general alert results. Will also be tested on a specific use case (large update_every and big lookup). ##### Additional Information  <details> <summary>For users: How does this change affect me?</summary>  </details>

So the end result should be something like this:

template: Alerts_Inactive_Replication_Slots_6h
       on: Alerts.Inactive_Replication_Slots
   lookup: average -6h unaligned of Inactive
    units: slots
    every: 15m
     warn: $this >=  1
  summary: Inactive Replication Slots 6h
     info: Inactive Replication Slots 6h
       to: dba

Unfortunately it’s not possible to change the query AKA lookup to use a different time window, so for the 24h check we will need to create another alert.

template: Alerts_Inactive_Replication_Slots_24h
       on: Alerts.Inactive_Replication_Slots
   lookup: average -24h unaligned of Inactive
    units: slots
    every: 30min
     crit: $this >=  1
  summary: Inactive Replication Slots 24h
     info: Inactive Replication Slots 24h
       to: dba

Also, I see you have charts AKA on field and dimension AKA lookup: average ... Inactive with upper case letter, please confirm those are correct. Usually they are all lower case.

I hope this help you out.

Richard_Barrantes · December 3, 2024, 5:15pm

Hello @car12o

Thank you for the information! I will proceed with setting up a separate alert for the 24-hour check. This was very helpful.

Regards,

Topic		Replies	Views
How to manage notifications for PostgreSQL cluster (Primary / Standby) Help agent	0	76	June 17, 2024
Adjust mysql replication sensitivity Help database-monitoring , agent	2	601	April 24, 2021
Can't view alerts Help	6	217	April 10, 2025
Alert when non-ephemeral nodes go stale Help agent , alerts	7	370	April 5, 2024
Alert configuration httpcheck timeout Help	5	44	August 16, 2024

Alert status escalation

Related topics