Hi,
I want to trigger an alarm when a particular chart on my netdata instance never reaches zero over the last 24 hours. (For more context, the chart is generated by this plugin)
I have configured /etc/netdata/health.d/apt.conf in the following way:
alarm: apt_upgradable
on: apt.upgradable
lookup: min -1d of upgradable
every: 60s
warn: $this > 0
units: packages
info: packages with available upgrades
to: sysadmin
However, this alarm stays active, even though the apt.upgradable chart value has been 0 for several hours. My goal is for this alarm to be active only when the chart value has constantly been > 0 for the last day.
Relevant docs you followed/actions you took to solve the issue
You could perhaps use sum. When the sum is 0 over the last day, the alert will be clear. If there is a package to be upgraded it will raise to warning. However, clearing after a warning would take a whole day without any packages to update, so not sure if you’d like that.
I would maybe suggest to also lower the day to maybe an hour. I’ll try to think of a better way.
I cannot use sum in this case, because I only expect the alarm to be raised if the chart value is constantly > 0 over the last 24 hours.
It is normal and expected to have a value of > 0 for a brief period (up to 23h59), unattended-upgrades will run once a day, and automatically upgrade packages that need to be upgraded. Even during the automatic upgrade process itself, there is a brief period when the chart value is > 0 (package lists have been updated but the package upgrade process is not completely finished). So using sum would trigger an alarm (and not clear it for 24 hours), even if no manual action is needed (the situation will resolve itself under the accepted delay).
However, if the number of upgradable packages is constantly > 0 for the last 24 hours, that may indicate a problem that needs to be looked into (a specific package fails to upgrade, or auto-upgrades from a specific third-party repository are not enabled and the package must be upgraded manually, etc).
I think lookup: min -1d of upgradable should perfectly describe this condition in theory, correct? But as you can see, it does not work in practice.
Hi, I just had the same situation: 1 package not automatically upgraded (only for a few hours), alarm raised, I manually upgraded it at 15:00~15:10 GMT+2, alarm is still active right now (16:15 GMT+2), I will check when the alarm clears, and let you know.
From the looks of these charts, it appears the collector isn’t collecting data at all times? There are quite some gaps? That could interfere with the alert’s calculation. Could we check that first?
Here is the current configuration for the collector:
$ sudo cat /etc/netdata/python.d/apt.conf
# netdata python.d.plugin configuration for apt
#
# This file is in YaML format. Generally the format is:
#
# name: value
#
# There are 2 sections:
# - global variables
# - one or more JOBS
#
# JOBS allow you to collect values from multiple sources.
# Each source will have its own set of charts.
#
# JOB parameters have to be indented (using spaces only, example below).
# ----------------------------------------------------------------------
# Global Variables
# These variables set the defaults for all JOBs, however each JOB
# may define its own, overriding the defaults.
# update_every sets the default data collection frequency.
# If unset, the python.d.plugin default is used.
update_every: 600
# priority controls the order of charts at the netdata dashboard.
# Lower numbers move the charts towards the top of the page.
# If unset, the default of 90000 is used.
# priority: 90000
# penalty indicates whether to apply penalty to update_every in case of failures.
# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes.
# penalty: yes
# autodetection_retry sets the job re-check interval in seconds.
# The job is not deleted if check fails.
# Attempts to start the job are made once every autodetection_retry.
# This feature is disabled by default.
# autodetection_retry: 0
# ----------------------------------------------------------------------
# JOBS (data collection sources)
#
# The default JOBS share the same *name*. JOBS with the same name
# are mutually exclusive. Only one of them will be allowed running at
# any time. This allows autodetection to try several alternatives and
# pick the one that works.
#
# Any number of jobs is supported.
#
# All python.d.plugin JOBS (for all its modules) support a set of
# predefined parameters. These are:
#
# job_name:
# name: myname # the JOB's name as it will appear at the
# # dashboard (by default is the job_name)
# # JOBs sharing a name are mutually exclusive
# update_every: 1 # the JOB's data collection frequency
# priority: 60000 # the JOB's order on the dashboard
# penalty: yes # the JOB's penalty
# autodetection_retry: 0 # the JOB's re-check interval in seconds
#
# This module does not provide any additional option.
And the health configuration:
$ sudo cat /etc/netdata/health.d/apt.conf
alarm: apt_upgradable
on: apt.upgradable
lookup: min -1d of upgradable
every: 60s
warn: $this > 0
units: packages
info: packages with available upgrades
to: sysadmin
class: Errors
alarm: apt_distribution_version
on: apt.distribution_version
calc: $distribution_version
every: 60s
warn: $this < 12
crit: $this < 11
units: distribution version
info: distribution upgrade available
to: sysadmin
class: Errors
alarm: distribution_version_error
on: apt.distribution_version_error
calc: $distribution_version_error
every: 60s
warn: $this > 0
units: apt failed check
info: state file was unreadable
to: sysadmin
class: Errors
Another possibility that we think might be the cause, is that it gets the values from another tier than the first one (because of the -1d), leading to not being actually able to detect a 0 value (since the values from other tiers are averaged).
A possible solution would be to force health to run alerts from values in tier 0 only. I’ll try to check and update.
Edit: Could you check if the alert works better with latest nightlies? Thanks!
I’ve updated the lookup expression on all my hosts and will keep an eye on the results.
the 600 update every of the collector. Could you perhaps test with e.g. 60 ?
I’d rather avoid rather these costly checks to run too often (a maximum 10 minute lag is acceptable), but I have switched to 60s on a few hosts and will keep an eye on the results.
Another possibility that we think might be the cause, is that it gets the values from another tier than the first one (because of the -1d), leading to not being actually able to detect a 0 value (since the values from other tiers are averaged).
Could you check if the alert works better with latest nightlies?
I deploy from Debian packages from netdata/netdata - Packages · packagecloud so I will have to wait for the next release (which will give me some time to evaluate the changes I already made). I will keep you updated.
The problem now seems to be resolved by adding unaligned to the lookup expression (lookup: min -1d unaligned of upgradable) on netdata 1.43.0. The collector update every frequency has been kept to 600, the alarm every has been kept to 60s.
The alarm is cleared immediately when the collector updates the chart and the new minimum value becomes 0.