How to "soften" slow web service alarms

jurgenhaas · March 2, 2021, 3:54pm

On some of our high performance web hosts we experience too many web_service_slow alarms and they often turn critical even for an average response time of 90ms or less. The health template is configured like this:

template: web_service_slow
families: *
      on: httpcheck.responsetime
  lookup: average -5m unaligned of time
   units: ms
   every: 10s
    warn: ($this > ($1h_web_service_response_time * 4) )
    crit: ($this > ($1h_web_service_response_time * 6) )
    info: average response time over the last 5 minutes, compared to the average over the last hour
   delay: down 5m multiplier 1.5 max 1h
      to: webmaster

We analysed the problem and it seems that we have regular response times between 0.01 and 0.05 most of the time. Every now and then, there is a single request that takes a couple of hundred milliseconds to respond, but that immediately “ruins” the average value and triggers the alarm.

Not sure if this is the proper thinking but we came up with the idea to ignore the lowest and highest value when calculating the average, that may should already provide a much more realistic picture in this context.

Is that the right way to think about it and if so, can this be done already in Netdata? Or is there better ways to approach this?

ilyam8 · March 2, 2021, 7:04pm

Hi @jurgenhaas

ignore the lowest and highest value

Sounds like median. Unfortunately there is no such lookup method (available: average, min, max, sum).

Your use case is clear, i agree that median method would be a nice addition

ilyam8 · March 2, 2021, 7:05pm

@dim08 @Stelios_Fragkakis take a look

Christopher_Akritid1 · March 2, 2021, 7:47pm

Percentile charts would probably be more appropriate for latencies. Then we could just have normal alerts on top of those, instead of the ones with the avg response time. We have recently dealt with histograms that we convert to such charts, what do you think about adding some of those here @ilyam8 ?

ilyam8 · March 3, 2021, 5:05pm

That is true, i agree. I like that median lookup method idea, because i’ve dealt with our alarms recently and i had exactly same thought - average prone to false positives, especially when we lookup 10+minutes. Any spike ruins the picture.

And not only latencies would benefit from changing average to median. I guess it wasn’t implemented, because using average is much cheaper in terms of cpu usage.

jurgenhaas · March 8, 2021, 9:14am

OK, sounds like this is something for the product and my question is, what can we do to get this onto the roadmap?

OdysLam · March 8, 2021, 10:19am

Hey @jurgenhaas,

The most immediate thing you can do is to contribute this to the Netdata Agent. Although this would require some C knowledge, we are more than eager to help you! cc @vlvkobal

The other route would be to create a topic on #feature-requests:agent-fr and wait for the product team to pick it up and prioritize it accordingly depending on our internal roadmap and the community. Of course, if we see a large interest by the community, it will be prioritized higher.

P.S Have a great week!

jurgenhaas · March 8, 2021, 10:25am

Thanks @OdysLam for your reply. Unfortunately we have no C knowledge whatsoever in our organisations, so we have to go for option 2 and do some promotion too so that people hopefully show their interest in this for you to better prioritize it then.

Have a great week too, thanks.

ilyam8 · December 7, 2021, 11:20am

Actually seems there is - see Median | Learn Netdata.

Topic		Replies	Views
httpcheck_web_service_slow Alerts	0	881	November 3, 2021
Alert Configuration Question Help agent	4	688	September 7, 2020
False Alarm? Help agent	1	701	September 4, 2020
False alarms detection Help agent	16	1231	September 7, 2021
anomalies_anomaly_probabilities Alerts	0	432	November 3, 2021

How to "soften" slow web service alarms

Related topics