I get flooded with false alarms. I spend a lot of time adjusting the thresholds but I never get the perfect setup. I am thinking how to reduce the number of alarms.
Using solely the threshold criteria is not enough and sometimes even misleading!
For example, a 10min_disk_backlog of 5000ms is perfectly expected on a server that does some work it was designed for, but a 200ms on the same server when idle it would indicate a hardware disk issue. Setting the threshold at 6000ms would eliminate the normal workload false alerts, but otherwise I would miss the 200ms alert on idle. Setting the threshold at 100ms would flood with false alarms when doing normal work.
Therefore I propose an enhancement to the alarms system, by allowing the user to run a customised supplemental verification. For the above example, I imagine it would work like this:
user adds this line to the 10min_disk_backlog template:
veto script: /usr/local/bin/10min_disk_utilization.sh
netdata would then execute this script, passing some arguments to it like “$this”, $warn, $crit, block device /dev/ path, let it run for a limited time (10 seconds?), then use it’s output or exit code to determine if this alarm is in fact a false alarm. if script reports a false alarm, then do not consider it to be an alarm so do not send user notification, and only log an entry in the log file instead. if script does not report in allowed time, report this as an additional alarm and consider that the script did not detect a false alarm
in the script, we know we are called only by netdata on a certain condition so we know precisely what to look for based on our previous knowledge of the machine workload. we look for known processes that would cause the condition, examine the impact created on system by them (for example, I know from previous observations of this alerts that it rises every time my iphone is connected to wifi, uploads camera media to cloud, then my virtual macos kvm machine starts downloading them from the cloud and does some video processing on them. I then look for disk activity from qemu process and make my own threshold decision if this is caused by qemu or by someone else, and exit with accordingly exit code)
It would be also nice to provide instead of the script name, a process name that is top i/o generator. netdata already should know which processes are top i/o generators - I can see them in some graph already in the web interface - so I could for above example simply say:
veto process: qemu-kvm smbd zpool zfs rsync
if qemu-kvm or smbd or zpool or zfs are found by netdata to be the most i/o process, then it would ignore the alarm.
this example is only for 10min_disk_backlog template, but I imagine this can be done for virtually all types of alarm templates. for example on network issues, I could use netblocks that are doing i/o and have something like
veto subnets: A.B.C.D/X in addition to
veto process and
veto script. for cpu usage, you could have
veto kvm: list of libvirt guests names running. I don’t think for example
veto kvm should be restricted to the cpu alarm, they should be allowed in all types of alarms, even if it wouldn’t make sense to have
veto kvm affect the squid last collected alarm for example. The user should be able to make any combination he desires.
I can think of many other veto keywords/criteria if this idea is embraced by the team but I’ll stop for now waiting for your feedback.