I used the following trying to capture an app (Spectre) running over 290 percents for the last 10 minutes. The email alert works fine if a job running CPU over 290% CONSTANTLY for 10 minutes or longer. However, I have another email alert with another job running CPU over 1200% for a little bit over 2.5 minutes. Mathematically, both email alerts seem working fine but the latter one isn’t the one I expect.
How do I configure to capture the app running over 290 percents for full 10 minutes. In other words, it triggers when all captured metrics for Spectre must be equal or greater than 290%. Using “average” in lookup does not work for me when there are spikes occurred.
Thanks!
alarm: apps_cpu_Spectre
on: apps.cpu
os: linux
hosts: *
lookup: average -10m unaligned of Spectre
unit: %
every: 1m
warn: $this > (($status >= $WARNING) ? (280) : (290))
crit: $this > (($status == $CRITICAL) ? (290) : (390))
delay: down 15m multiplier 1.5 max 1h
to: sysadmin