Alert Configuration Question



  • I used the following trying to capture an app (Spectre) running over 290 percents for the last 10 minutes. The email alert works fine if a job running CPU over 290% CONSTANTLY for 10 minutes or longer. However, I have another email alert with another job running CPU over 1200% for a little bit over 2.5 minutes. Mathematically, both email alerts seem working fine but the latter one isn’t the one I expect.

    How do I configure to capture the app running over 290 percents for full 10 minutes. In other words, it triggers when all captured metrics for Spectre must be equal or greater than 290%. Using “average” in lookup does not work for me when there are spikes occurred.

    Thanks!

    alarm: apps_cpu_Spectre
    on: apps.cpu
    os: linux
    hosts: *
    lookup: average -10m unaligned of Spectre
    unit: %
    every: 1m
    warn: $this > (($status >= $WARNING) ? (280) : (290))
    crit: $this > (($status == $CRITICAL) ? (290) : (390))
    delay: down 15m multiplier 1.5 max 1h
    to: sysadmin



  • You may have to use a more complex expression

    For example, instead of average, you might also want to use min.



  • Thank you for the reply. But how do I use “min”? AFAIK, if I have the line like the following:

    lookup: min -10m unaligned of Spectre

    a single value will be return to $this variable. Am I able to query/access the dataset from the last 10 minutes from that health configuration file and then post process the dataset?

    Thanks.



  • Update:

    I just replaced “average” with “min” in the above example and the alert worked as expected.

    Thanks!


  • Staff

    Cool, thank you for letting us know!


Log in to reply