Trigger an alarm when a chart never reaches zero over the last 24 hours

Problem/Question

Hi,
I want to trigger an alarm when a particular chart on my netdata instance never reaches zero over the last 24 hours. (For more context, the chart is generated by this plugin)

I have configured /etc/netdata/health.d/apt.conf in the following way:

 alarm: apt_upgradable
    on: apt.upgradable
lookup: min -1d of upgradable
 every: 60s
  warn: $this > 0
 units: packages
  info: packages with available upgrades
    to: sysadmin

However, this alarm stays active, even though the apt.upgradable chart value has been 0 for several hours. My goal is for this alarm to be active only when the chart value has constantly been > 0 for the last day.

Relevant docs you followed/actions you took to solve the issue

  • Double-checked that my lookup expression was correct on Configure alerts | Learn Netdata (it doesn’t mention the min operator)
  • Tried to find other example uses of the min operator in stock netdata health configuration
  • Tried changing the health configuration file in a few ways (warn: $this != nan AND $this > 0…), without results

Here is the chart over the last 26 hours.

Environment/Browser/Agent’s version etc

  • Debian 11/12
  • Netdata agent 1.40.1

What I expected to happen

In these conditions I don’t expect the alarm to be active.

Hi @nodiscc

You could perhaps use sum. When the sum is 0 over the last day, the alert will be clear. If there is a package to be upgraded it will raise to warning. However, clearing after a warning would take a whole day without any packages to update, so not sure if you’d like that.

I would maybe suggest to also lower the day to maybe an hour. I’ll try to think of a better way.

Hi @Manolis_Vasilakis , thanks for the reply.

I cannot use sum in this case, because I only expect the alarm to be raised if the chart value is constantly > 0 over the last 24 hours.

It is normal and expected to have a value of > 0 for a brief period (up to 23h59), unattended-upgrades will run once a day, and automatically upgrade packages that need to be upgraded. Even during the automatic upgrade process itself, there is a brief period when the chart value is > 0 (package lists have been updated but the package upgrade process is not completely finished). So using sum would trigger an alarm (and not clear it for 24 hours), even if no manual action is needed (the situation will resolve itself under the accepted delay).

However, if the number of upgradable packages is constantly > 0 for the last 24 hours, that may indicate a problem that needs to be looked into (a specific package fails to upgrade, or auto-upgrades from a specific third-party repository are not enabled and the package must be upgraded manually, etc).

I think lookup: min -1d of upgradable should perfectly describe this condition in theory, correct? But as you can see, it does not work in practice.

Yes, in this respect it makes sense. Indeed the min in this case sounds reasonable. I’ll try to reproduce a similar alert and will let you know.

Hey, @nodiscc. And what value do you see in Alarms->All->your alarm?

Hi @nodiscc

Can you try lookup: min -1d unaligned of upgradable ?

Hi @ilyam8

I had another occurrence of this problem today, it allowed me to investigate again:

  • a package had an upgrade pending for several days → chart value constantly > 0 → alarm raised → good!
  • changed lookup: to min -1d unaligned of upgradable, ran sudo netdatacli reload-health
  • ran apt upgrade, package successfully upgraded, verify that no more upgrades are available with apt list --upgradable → OK

  • Waited 10 minutes, expecting the alarm to clear → alarm not cleared:

what value do you see in Alarms->All->your alarm?

As you can see the value stays at 1

Hi @nodiscc

Eventually, do you have a point where it did clear? Will check this again…

Hi, I just had the same situation: 1 package not automatically upgraded (only for a few hours), alarm raised, I manually upgraded it at 15:00~15:10 GMT+2, alarm is still active right now (16:15 GMT+2), I will check when the alarm clears, and let you know.

I checked the event log for this alarm (it is still active, ~5 hours after):

Here is the chart to which this alarm is attached (last 6 hours):

I will provide an update when the alarm clears.

It cleared at 02:52:45, for no apparent reason (exactly 14 hours after it was first raised):

Hi @nodiscc

From the looks of these charts, it appears the collector isn’t collecting data at all times? There are quite some gaps? That could interfere with the alert’s calculation. Could we check that first?

Are you using a custom collector to gather these?

Hi,

it appears the collector isn’t collecting data at all times? There are quite some gaps?

Correct, these gaps are caused by the VM/host not being on all the time.

Are you using a custom collector to gather these?

Yes as mentioned in the first post this chart is generated by GitHub - nodiscc/netdata-apt: [mirror] Check/graph number of upgradable packages - netdata plugin which I maintain.

Currently you can see the alarm being active:

However upgrades were applied a few hours ago (so the alarm should now be inactive):

Here is the current configuration for the collector:

$ sudo cat /etc/netdata/python.d/apt.conf 
# netdata python.d.plugin configuration for apt
#
# This file is in YaML format. Generally the format is:
#
# name: value
#
# There are 2 sections:
#  - global variables
#  - one or more JOBS
#
# JOBS allow you to collect values from multiple sources.
# Each source will have its own set of charts.
#
# JOB parameters have to be indented (using spaces only, example below).

# ----------------------------------------------------------------------
# Global Variables
# These variables set the defaults for all JOBs, however each JOB
# may define its own, overriding the defaults.

# update_every sets the default data collection frequency.
# If unset, the python.d.plugin default is used.
update_every: 600

# priority controls the order of charts at the netdata dashboard.
# Lower numbers move the charts towards the top of the page.
# If unset, the default of 90000 is used.
# priority: 90000

# penalty indicates whether to apply penalty to update_every in case of failures.
# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes.
# penalty: yes

# autodetection_retry sets the job re-check interval in seconds.
# The job is not deleted if check fails.
# Attempts to start the job are made once every autodetection_retry.
# This feature is disabled by default.
# autodetection_retry: 0

# ----------------------------------------------------------------------
# JOBS (data collection sources)
#
# The default JOBS share the same *name*. JOBS with the same name
# are mutually exclusive. Only one of them will be allowed running at
# any time. This allows autodetection to try several alternatives and
# pick the one that works.
#
# Any number of jobs is supported.
#
# All python.d.plugin JOBS (for all its modules) support a set of
# predefined parameters. These are:
#
# job_name:
#     name: myname            # the JOB's name as it will appear at the
#                             # dashboard (by default is the job_name)
#                             # JOBs sharing a name are mutually exclusive
#     update_every: 1         # the JOB's data collection frequency
#     priority: 60000         # the JOB's order on the dashboard
#     penalty: yes            # the JOB's penalty
#     autodetection_retry: 0  # the JOB's re-check interval in seconds
#
# This module does not provide any additional option.

And the health configuration:

$ sudo cat /etc/netdata/health.d/apt.conf
 alarm: apt_upgradable
    on: apt.upgradable
lookup: min -1d of upgradable
 every: 60s
  warn: $this > 0
 units: packages
  info: packages with available upgrades
    to: sysadmin
 class: Errors

 alarm: apt_distribution_version
    on: apt.distribution_version
  calc: $distribution_version
 every: 60s
  warn: $this < 12
  crit: $this < 11
 units: distribution version
  info: distribution upgrade available
    to: sysadmin
 class: Errors

 alarm: distribution_version_error
    on: apt.distribution_version_error
  calc: $distribution_version_error
 every: 60s
  warn: $this > 0
 units: apt failed check
  info: state file was unreadable
    to: sysadmin
 class: Errors

Hi @nodiscc

Thanks. I’ll setup them on a Debian test node and will try to debug.

So, I’ve been testing this for a couple of days, and indeed there is an issue.

unaligned in lookup seems to give better results.

One guess perhaps is that the problem lies with the 600 update every of the collector. Could you perhaps test with e.g. 60 ?

1 Like

Another possibility that we think might be the cause, is that it gets the values from another tier than the first one (because of the -1d), leading to not being actually able to detect a 0 value (since the values from other tiers are averaged).

A possible solution would be to force health to run alerts from values in tier 0 only. I’ll try to check and update.

Edit: Could you check if the alert works better with latest nightlies? Thanks!

unaligned in lookup seems to give better results.

I’ve updated the lookup expression on all my hosts and will keep an eye on the results.

the 600 update every of the collector. Could you perhaps test with e.g. 60 ?

I’d rather avoid rather these costly checks to run too often (a maximum 10 minute lag is acceptable), but I have switched to 60s on a few hosts and will keep an eye on the results.

Another possibility that we think might be the cause, is that it gets the values from another tier than the first one (because of the -1d), leading to not being actually able to detect a 0 value (since the values from other tiers are averaged).
Could you check if the alert works better with latest nightlies?

I deploy from Debian packages from netdata/netdata - Packages · packagecloud so I will have to wait for the next release (which will give me some time to evaluate the changes I already made). I will keep you updated.

Thanks again for your help

Hi,

The problem now seems to be resolved by adding unaligned to the lookup expression (lookup: min -1d unaligned of upgradable) on netdata 1.43.0. The collector update every frequency has been kept to 600, the alarm every has been kept to 60s.

The alarm is cleared immediately when the collector updates the chart and the new minimum value becomes 0.

Thanks @Manolis_Vasilakis @ilyam8 for your help.

As a last request, where can I read more about the unaligned operator? I could not find anything on Configure alerts | Learn Netdata

Thanks again!

Hi @nodiscc glad to see it worked.

For unaligned, there is some more information here → Database queries/lookup | Learn Netdata

1 Like