On some of our high performance web hosts we experience too many web_service_slow alarms and they often turn critical even for an average response time of 90ms or less. The health template is configured like this:
template: web_service_slow
families: *
on: httpcheck.responsetime
lookup: average -5m unaligned of time
units: ms
every: 10s
warn: ($this > ($1h_web_service_response_time * 4) )
crit: ($this > ($1h_web_service_response_time * 6) )
info: average response time over the last 5 minutes, compared to the average over the last hour
delay: down 5m multiplier 1.5 max 1h
to: webmaster
We analysed the problem and it seems that we have regular response times between 0.01 and 0.05 most of the time. Every now and then, there is a single request that takes a couple of hundred milliseconds to respond, but that immediately “ruins” the average value and triggers the alarm.
Not sure if this is the proper thinking but we came up with the idea to ignore the lowest and highest value when calculating the average, that may should already provide a much more realistic picture in this context.
Is that the right way to think about it and if so, can this be done already in Netdata? Or is there better ways to approach this?
Percentile charts would probably be more appropriate for latencies. Then we could just have normal alerts on top of those, instead of the ones with the avg response time. We have recently dealt with histograms that we convert to such charts, what do you think about adding some of those here @ilyam8 ?
That is true, i agree. I like that median lookup method idea, because i’ve dealt with our alarms recently and i had exactly same thought - average prone to false positives, especially when we lookup 10+minutes. Any spike ruins the picture.
And not only latencies would benefit from changing average to median. I guess it wasn’t implemented, because using average is much cheaper in terms of cpu usage.
The most immediate thing you can do is to contribute this to the Netdata Agent. Although this would require some C knowledge, we are more than eager to help you! cc @vlvkobal
The other route would be to create a topic on #feature-requests:agent-fr and wait for the product team to pick it up and prioritize it accordingly depending on our internal roadmap and the community. Of course, if we see a large interest by the community, it will be prioritized higher.
Thanks @OdysLam for your reply. Unfortunately we have no C knowledge whatsoever in our organisations, so we have to go for option 2 and do some promotion too so that people hopefully show their interest in this for you to better prioritize it then.