On some of our high performance web hosts we experience too many web_service_slow alarms and they often turn critical even for an average response time of 90ms or less. The health template is configured like this:
template: web_service_slow families: * on: httpcheck.responsetime lookup: average -5m unaligned of time units: ms every: 10s warn: ($this > ($1h_web_service_response_time * 4) ) crit: ($this > ($1h_web_service_response_time * 6) ) info: average response time over the last 5 minutes, compared to the average over the last hour delay: down 5m multiplier 1.5 max 1h to: webmaster
We analysed the problem and it seems that we have regular response times between 0.01 and 0.05 most of the time. Every now and then, there is a single request that takes a couple of hundred milliseconds to respond, but that immediately “ruins” the average value and triggers the alarm.
Not sure if this is the proper thinking but we came up with the idea to ignore the lowest and highest value when calculating the average, that may should already provide a much more realistic picture in this context.
Is that the right way to think about it and if so, can this be done already in Netdata? Or is there better ways to approach this?