How to "soften" slow web service alarms

On some of our high performance web hosts we experience too many web_service_slow alarms and they often turn critical even for an average response time of 90ms or less. The health template is configured like this:

template: web_service_slow
families: *
      on: httpcheck.responsetime
  lookup: average -5m unaligned of time
   units: ms
   every: 10s
    warn: ($this > ($1h_web_service_response_time * 4) )
    crit: ($this > ($1h_web_service_response_time * 6) )
    info: average response time over the last 5 minutes, compared to the average over the last hour
   delay: down 5m multiplier 1.5 max 1h
      to: webmaster

We analysed the problem and it seems that we have regular response times between 0.01 and 0.05 most of the time. Every now and then, there is a single request that takes a couple of hundred milliseconds to respond, but that immediately “ruins” the average value and triggers the alarm.

Not sure if this is the proper thinking but we came up with the idea to ignore the lowest and highest value when calculating the average, that may should already provide a much more realistic picture in this context.

Is that the right way to think about it and if so, can this be done already in Netdata? Or is there better ways to approach this?

Hi @jurgenhaas

ignore the lowest and highest value

Sounds like median. Unfortunately there is no such lookup method (available: average, min, max, sum).

Your use case is clear, i agree that median method would be a nice addition :thinking:

1 Like

@dim08 @Stelios_Fragkakis take a look

1 Like

Percentile charts would probably be more appropriate for latencies. Then we could just have normal alerts on top of those, instead of the ones with the avg response time. We have recently dealt with histograms that we convert to such charts, what do you think about adding some of those here @ilyam8 ?

That is true, i agree. I like that median lookup method idea, because i’ve dealt with our alarms recently and i had exactly same thought - average prone to false positives, especially when we lookup 10+minutes. Any spike ruins the picture.

And not only latencies would benefit from changing average to median. I guess it wasn’t implemented, because using average is much cheaper in terms of cpu usage.

OK, sounds like this is something for the product and my question is, what can we do to get this onto the roadmap?

Hey @jurgenhaas,

The most immediate thing you can do is to contribute this to the Netdata Agent. Although this would require some C knowledge, we are more than eager to help you! cc @vlvkobal

The other route would be to create a topic on #feature-requests:agent-fr and wait for the product team to pick it up and prioritize it accordingly depending on our internal roadmap and the community. Of course, if we see a large interest by the community, it will be prioritized higher.

P.S Have a great week!

Thanks @OdysLam for your reply. Unfortunately we have no C knowledge whatsoever in our organisations, so we have to go for option 2 and do some promotion too so that people hopefully show their interest in this for you to better prioritize it then.

Have a great week too, thanks.

1 Like

Actually seems there is - see Median | Learn Netdata.