We would like to enhance an alert for monitoring PostgreSQL replication slots by raising a warning if the condition persists for 6 hours and escalating to critical if it persists for 24 hours, based on the following alert configuration:
template: Alerts_Inactive_Replication_Slots
on: Alerts.Inactive_Replication_Slots
lookup: average -5m unaligned of Inactive
units: slots
every: 60s
warn: $this >= (1)
delay: up 6h
summary: Inactive Replication Slots
info: Inactive Replication Slots
to: dba
The configuration you shared is pretty close to what you desire, it just needs small changes.
If you want the alert to check on a time window you need change the query (lookup field) . Changing delay field would just delay the notification but the alert will still trigger for the average of the last 5min.
Since query a big time window is a heavy query, this also bypasses tiering and requests high resolution data from tier 0 (per second) not tier 1 (per min) or tier 2 per hour, we may want to do not check the alert that often (every 60s). Let’s run the alert check every 15min for example by changing the every field.
So the end result should be something like this:
template: Alerts_Inactive_Replication_Slots_6h
on: Alerts.Inactive_Replication_Slots
lookup: average -6h unaligned of Inactive
units: slots
every: 15m
warn: $this >= 1
summary: Inactive Replication Slots 6h
info: Inactive Replication Slots 6h
to: dba
Unfortunately it’s not possible to change the query AKA lookup to use a different time window, so for the 24h check we will need to create another alert.
template: Alerts_Inactive_Replication_Slots_24h
on: Alerts.Inactive_Replication_Slots
lookup: average -24h unaligned of Inactive
units: slots
every: 30min
crit: $this >= 1
summary: Inactive Replication Slots 24h
info: Inactive Replication Slots 24h
to: dba
Also, I see you have charts AKA on field and dimension AKA lookup: average ... Inactive with upper case letter, please confirm those are correct. Usually they are all lower case.