False alarms detection

I get flooded with false alarms. I spend a lot of time adjusting the thresholds but I never get the perfect setup. I am thinking how to reduce the number of alarms.
Using solely the threshold criteria is not enough and sometimes even misleading!

For example, a 10min_disk_backlog of 5000ms is perfectly expected on a server that does some work it was designed for, but a 200ms on the same server when idle it would indicate a hardware disk issue. Setting the threshold at 6000ms would eliminate the normal workload false alerts, but otherwise I would miss the 200ms alert on idle. Setting the threshold at 100ms would flood with false alarms when doing normal work.

Therefore I propose an enhancement to the alarms system, by allowing the user to run a customised supplemental verification. For the above example, I imagine it would work like this:

user adds this line to the 10min_disk_backlog template:
veto script: /usr/local/bin/10min_disk_utilization.sh

netdata would then execute this script, passing some arguments to it like “$this”, $warn, $crit, block device /dev/ path, let it run for a limited time (10 seconds?), then use it’s output or exit code to determine if this alarm is in fact a false alarm. if script reports a false alarm, then do not consider it to be an alarm so do not send user notification, and only log an entry in the log file instead. if script does not report in allowed time, report this as an additional alarm and consider that the script did not detect a false alarm

in the script, we know we are called only by netdata on a certain condition so we know precisely what to look for based on our previous knowledge of the machine workload. we look for known processes that would cause the condition, examine the impact created on system by them (for example, I know from previous observations of this alerts that it rises every time my iphone is connected to wifi, uploads camera media to cloud, then my virtual macos kvm machine starts downloading them from the cloud and does some video processing on them. I then look for disk activity from qemu process and make my own threshold decision if this is caused by qemu or by someone else, and exit with accordingly exit code)

It would be also nice to provide instead of the script name, a process name that is top i/o generator. netdata already should know which processes are top i/o generators - I can see them in some graph already in the web interface - so I could for above example simply say:

veto process: qemu-kvm smbd zpool zfs rsync

if qemu-kvm or smbd or zpool or zfs are found by netdata to be the most i/o process, then it would ignore the alarm.

this example is only for 10min_disk_backlog template, but I imagine this can be done for virtually all types of alarm templates. for example on network issues, I could use netblocks that are doing i/o and have something like veto subnets: A.B.C.D/X in addition to veto process and veto script. for cpu usage, you could have veto kvm: list of libvirt guests names running. I don’t think for exampleveto kvm should be restricted to the cpu alarm, they should be allowed in all types of alarms, even if it wouldn’t make sense to have veto kvm affect the squid last collected alarm for example. The user should be able to make any combination he desires.

I can think of many other veto keywords/criteria if this idea is embraced by the team but I’ll stop for now waiting for your feedback.

Hi, @ml35.

spend a lot of time adjusting the thresholds but I never get the perfect setup.

Yes, I completely with you on this :handshake:


I get flooded with false alarms.

What alarms are you referring to? 10min_disk_backlog has to: silent by default (no notifications), did you change it?

I honestly don’t remember changing it. I usually tend to silent some more, not activate disabled. Can I view the defaults with some command? I will make a list of false alarms - they are false to my particular case however and could be real alarms for the other users! However I believe allowing a script to override any alarm would be useful and could do it regardless of what alarm if user customizes to the particular workload instead of silencing it.

here is a disks.conf from a backup dated 2017-03-03 that I pretty sure did not modify:

# raise an alarm if the disk backlog
# is above 1000ms (1s) per second
# for 10 minutes
# (i.e. the disk cannot catch up)

template: 10min_disk_backlog
      on: disk.backlog
families: *
  lookup: average -10m unaligned
   units: ms
   every: 1m
   green: 2000
     red: 5000
    warn: $this > $green * (($status >= $WARNING)  ? (0.7) : (1))
    crit: $this > $red   * (($status == $CRITICAL) ? (0.7) : (1))
   delay: down 15m multiplier 1.2 max 1h
    info: average of the kernel estimated disk backlog, for the last 10 minutes
      to: sysadmin

#chroot /backup; netdata -v
netdata 1.5.1_rolling

it’s identical to the file provided by the tarball

# diff /usr/local/src/netdata/conf.d/health.d/disks.conf /etc/netdata/health.d/disks.conf | wc -l
0

@ml35 That alarm appeared to be false positive in a lot of cases, we changed it to to: silent. Another problem with it - it is not disk backlog size, the metric name (hence the alarm) is misleading/wrong.

Hold on a minute. Latest stable is v1.31.0. Consider updating.

@ml35 having good stock alarms is important for us, we do stock alarms analysis from time to time and tune them accordingly. One of the goals is to reduce false positives. So my advice is to update Netdata regularly.

Keep in mind that user configuration (/etc/netdata/health.d/) takes precedence over stock configuration.

They should be under /usr/local/src/netdata/conf.d/health.d/. Latest is https://github.com/netdata/netdata/tree/master/health/health.d.

I am already using latest or as new as possible everywhere. That was just an example that supported my theory that it wasn’t me who made any change to “to: sysadmin” setting.

that is nice, but netdata is not one size fits all and you will never get satisfactory results for everyone. that is why I got this idea of (1) having a simple call to a user provided script to double check that this is a real alarm (2) try to be smart and detect a false alarm by doing conditional checks (a user provided list of system states already known by netdata that if present then makes it a false alarm)

I think (1) can be quite easy to be implemented as already suggested while (2) needs further discussion

you changed it for new installs but how do you also handle existing installs? what if user ever did edit-config healt.d/disks.conf before the change, and only altered some threshold but not the “to: sysadmin” setting, did the global change ever get effective? because I’m running latest or close to latest everywhere but I am flooded from everywhere with 10min disk backlog and 10 min disk utilization and inbound packets dropped and outbound packets dropped etc

@ml35 hello - this is all great feedback. I’d love to implement something ML related here to be able to re-prioritize “false alarms” (how ever we define it) in some way. A few idea’s we have and plan to explore are (1) a solid anomaly score that could be incorporated into alarm configs such that you might only trigger once alarm conditions met and anomaly score is elevated in some way to give additional evidence that something looks unusual (2) building CTR models for the alarms such that if we see an alarm with a CTR of say 0.5% vs one with a model CTR probability of like 5% we want to make it clear to the user that the 5% one is 10x more likely to have a click or be followed by a troubleshooting session in the telemetry data etc.

There are a few other ideas we want to explore too, like using all the health measures as a sort of fingerprint of the node and having users be able to just label certain fingerprints as uninteresting or not and then next time we see something if is similar enough to the fingerprint you already told netdata you don’t care about then maybe it gets routed differently.

These are all still very early ideas but we are hoping to make progress on them in the next few months so just wanted to chime in as the feedback on they thread is very useful for how we think about and prioritize which features to try begin with.

actually your idea (2) here is kinda similar to my “fingerprint” idea - if we could use the health check values as a sort of fingerprint then that can be the context for each alarm. Users could then thumbs up or down and that list of thumbs downed “fingerprints” would be the basis for a sort of conditional check - is this alarm and context for it (fingerprint) sufficiently close to one of the previous ones you told us you don’t care about? If so then maybe it gets routed a little differently so that over time its more and more novel alarms that would be left in the main default notification channel.

So basically i’d love to try use some ML based approaches so that users would not have to go as far as writing a sort of post alarm validation script but instead we could try learn some version of that over time.

that would be interesting, but I always prefer to have available manual overrides.

I hope you will consider to also make an easy way for user to submit data to netdata to train the ML for the subseqeuent released versions as well as the local offline instance, however I would definitely miss the freedom to customize the alerts with a manual check which I think it would be very quick to implement, as opposed to a long development that I would expect from a ML engine

I am thinking to go the complementary way: consider all alerts as valid, and have the additional check invalidate it.

From my understanding of reading above, you seem to want to treat all alarms as false then make additional checks to make it valid, sort of like trying to detect anomalies across a set of data. This would be awesome if the detection rate is 100%, or else you can miss important alerts. I would always prefer an extra false alert than miss a real one and I very much doubt an automated system will always have 100% accuracy rate.

nonetheless as I said above I believe the two ideas are complementary and if the anomaly detection is tuned to be more tolerant and therefore more prone to produce false alerts (with the goal of not missing a real one), I still think having an invalidating check would prove very useful

yep - agree that sometimes (a lot of the times) starting with ML is not always the solution when some more simple and interpretable business rules or logic can work just as well or even better or basically good enough. So think we might be talking a little about slightly different things.

I like you idea as a sort of post alert or post trigger check - could be a nice concept to have.

I think it’s a very valid idea and if more users find it a useful addition, we should implement it. The health engine would only call the script when it detects a state change anyway, so it shouldn’t be too expensive.

I do have a cheap workaround for you though, that doesn’t require changes in the health engine. If you don’t mind the alarm showing up as raised on the UI, you can put the extra check in a custom alarm notification function, so the agent doesn’t send you notifications for these false positives. It’s a simple bash script and it receives as arguments everything you need to make the decision except for the top processes you memtioned.