Trigger an alarm when a chart never reaches zero over the last 24 hours

nodiscc · July 16, 2023, 12:45pm

Problem/Question

Hi,
I want to trigger an alarm when a particular chart on my netdata instance never reaches zero over the last 24 hours. (For more context, the chart is generated by this plugin)

I have configured /etc/netdata/health.d/apt.conf in the following way:

 alarm: apt_upgradable
    on: apt.upgradable
lookup: min -1d of upgradable
 every: 60s
  warn: $this > 0
 units: packages
  info: packages with available upgrades
    to: sysadmin

However, this alarm stays active, even though the apt.upgradable chart value has been 0 for several hours. My goal is for this alarm to be active only when the chart value has constantly been > 0 for the last day.

Relevant docs you followed/actions you took to solve the issue

Double-checked that my lookup expression was correct on Configure Health Alerts | Learn Netdata (it doesn’t mention the min operator)
Tried to find other example uses of the min operator in stock netdata health configuration
Tried changing the health configuration file in a few ways (warn: $this != nan AND $this > 0…), without results

Here is the chart over the last 26 hours.

Environment/Browser/Agent’s version etc

Debian 11/12
Netdata agent 1.40.1

What I expected to happen

In these conditions I don’t expect the alarm to be active.

Manolis_Vasilakis · July 17, 2023, 4:23pm

Hi @nodiscc

You could perhaps use sum. When the sum is 0 over the last day, the alert will be clear. If there is a package to be upgraded it will raise to warning. However, clearing after a warning would take a whole day without any packages to update, so not sure if you’d like that.

I would maybe suggest to also lower the day to maybe an hour. I’ll try to think of a better way.

nodiscc · July 17, 2023, 4:35pm

Hi @Manolis_Vasilakis , thanks for the reply.

I cannot use sum in this case, because I only expect the alarm to be raised if the chart value is constantly > 0 over the last 24 hours.

It is normal and expected to have a value of > 0 for a brief period (up to 23h59), unattended-upgrades will run once a day, and automatically upgrade packages that need to be upgraded. Even during the automatic upgrade process itself, there is a brief period when the chart value is > 0 (package lists have been updated but the package upgrade process is not completely finished). So using sum would trigger an alarm (and not clear it for 24 hours), even if no manual action is needed (the situation will resolve itself under the accepted delay).

However, if the number of upgradable packages is constantly > 0 for the last 24 hours, that may indicate a problem that needs to be looked into (a specific package fails to upgrade, or auto-upgrades from a specific third-party repository are not enabled and the package must be upgraded manually, etc).

I think lookup: min -1d of upgradable should perfectly describe this condition in theory, correct? But as you can see, it does not work in practice.

Manolis_Vasilakis · July 17, 2023, 4:39pm

Yes, in this respect it makes sense. Indeed the min in this case sounds reasonable. I’ll try to reproduce a similar alert and will let you know.

ilyam8 · July 17, 2023, 8:34pm

Hey, @nodiscc. And what value do you see in Alarms->All->your alarm?

Manolis_Vasilakis · July 18, 2023, 9:12am

Hi @nodiscc

Can you try lookup: min -1d unaligned of upgradable ?

nodiscc · July 25, 2023, 9:54am

Hi @ilyam8

I had another occurrence of this problem today, it allowed me to investigate again:

a package had an upgrade pending for several days → chart value constantly > 0 → alarm raised → good!
changed lookup: to min -1d unaligned of upgradable, ran sudo netdatacli reload-health
ran apt upgrade, package successfully upgraded, verify that no more upgrades are available with apt list --upgradable → OK

Waited 10 minutes, expecting the alarm to clear → alarm not cleared:

what value do you see in Alarms->All->your alarm?

As you can see the value stays at 1

Manolis_Vasilakis · August 1, 2023, 1:02pm

Hi @nodiscc

Eventually, do you have a point where it did clear? Will check this again…

nodiscc · August 1, 2023, 2:16pm

Hi, I just had the same situation: 1 package not automatically upgraded (only for a few hours), alarm raised, I manually upgraded it at 15:00~15:10 GMT+2, alarm is still active right now (16:15 GMT+2), I will check when the alarm clears, and let you know.

nodiscc · August 1, 2023, 3:57pm

I checked the event log for this alarm (it is still active, ~5 hours after):

Here is the chart to which this alarm is attached (last 6 hours):

I will provide an update when the alarm clears.

nodiscc · August 2, 2023, 9:00am

It cleared at 02:52:45, for no apparent reason (exactly 14 hours after it was first raised):

Manolis_Vasilakis · September 18, 2023, 7:57am

Hi @nodiscc

From the looks of these charts, it appears the collector isn’t collecting data at all times? There are quite some gaps? That could interfere with the alert’s calculation. Could we check that first?

Are you using a custom collector to gather these?

nodiscc · September 18, 2023, 3:34pm

Hi,

it appears the collector isn’t collecting data at all times? There are quite some gaps?

Correct, these gaps are caused by the VM/host not being on all the time.

Are you using a custom collector to gather these?

Yes as mentioned in the first post this chart is generated by GitHub - nodiscc/netdata-apt: [mirror] Check/graph number of upgradable packages - netdata plugin which I maintain.

Currently you can see the alarm being active:

However upgrades were applied a few hours ago (so the alarm should now be inactive):

Here is the current configuration for the collector:

$ sudo cat /etc/netdata/python.d/apt.conf 
# netdata python.d.plugin configuration for apt
#
# This file is in YaML format. Generally the format is:
#
# name: value
#
# There are 2 sections:
#  - global variables
#  - one or more JOBS
#
# JOBS allow you to collect values from multiple sources.
# Each source will have its own set of charts.
#
# JOB parameters have to be indented (using spaces only, example below).

# ----------------------------------------------------------------------
# Global Variables
# These variables set the defaults for all JOBs, however each JOB
# may define its own, overriding the defaults.

# update_every sets the default data collection frequency.
# If unset, the python.d.plugin default is used.
update_every: 600

# priority controls the order of charts at the netdata dashboard.
# Lower numbers move the charts towards the top of the page.
# If unset, the default of 90000 is used.
# priority: 90000

# penalty indicates whether to apply penalty to update_every in case of failures.
# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes.
# penalty: yes

# autodetection_retry sets the job re-check interval in seconds.
# The job is not deleted if check fails.
# Attempts to start the job are made once every autodetection_retry.
# This feature is disabled by default.
# autodetection_retry: 0

# ----------------------------------------------------------------------
# JOBS (data collection sources)
#
# The default JOBS share the same *name*. JOBS with the same name
# are mutually exclusive. Only one of them will be allowed running at
# any time. This allows autodetection to try several alternatives and
# pick the one that works.
#
# Any number of jobs is supported.
#
# All python.d.plugin JOBS (for all its modules) support a set of
# predefined parameters. These are:
#
# job_name:
#     name: myname            # the JOB's name as it will appear at the
#                             # dashboard (by default is the job_name)
#                             # JOBs sharing a name are mutually exclusive
#     update_every: 1         # the JOB's data collection frequency
#     priority: 60000         # the JOB's order on the dashboard
#     penalty: yes            # the JOB's penalty
#     autodetection_retry: 0  # the JOB's re-check interval in seconds
#
# This module does not provide any additional option.

And the health configuration:

$ sudo cat /etc/netdata/health.d/apt.conf
 alarm: apt_upgradable
    on: apt.upgradable
lookup: min -1d of upgradable
 every: 60s
  warn: $this > 0
 units: packages
  info: packages with available upgrades
    to: sysadmin
 class: Errors

 alarm: apt_distribution_version
    on: apt.distribution_version
  calc: $distribution_version
 every: 60s
  warn: $this < 12
  crit: $this < 11
 units: distribution version
  info: distribution upgrade available
    to: sysadmin
 class: Errors

 alarm: distribution_version_error
    on: apt.distribution_version_error
  calc: $distribution_version_error
 every: 60s
  warn: $this > 0
 units: apt failed check
  info: state file was unreadable
    to: sysadmin
 class: Errors

Manolis_Vasilakis · September 19, 2023, 12:40pm

Hi @nodiscc

Thanks. I’ll setup them on a Debian test node and will try to debug.

Manolis_Vasilakis · September 21, 2023, 1:07pm

So, I’ve been testing this for a couple of days, and indeed there is an issue.

unaligned in lookup seems to give better results.

One guess perhaps is that the problem lies with the 600 update every of the collector. Could you perhaps test with e.g. 60 ?

Manolis_Vasilakis · September 21, 2023, 3:20pm

Another possibility that we think might be the cause, is that it gets the values from another tier than the first one (because of the -1d), leading to not being actually able to detect a 0 value (since the values from other tiers are averaged).

A possible solution would be to force health to run alerts from values in tier 0 only. I’ll try to check and update.

Edit: Could you check if the alert works better with latest nightlies? Thanks!

nodiscc · October 11, 2023, 1:37pm

unaligned in lookup seems to give better results.

I’ve updated the lookup expression on all my hosts and will keep an eye on the results.

the 600 update every of the collector. Could you perhaps test with e.g. 60 ?

I’d rather avoid rather these costly checks to run too often (a maximum 10 minute lag is acceptable), but I have switched to 60s on a few hosts and will keep an eye on the results.

Another possibility that we think might be the cause, is that it gets the values from another tier than the first one (because of the -1d), leading to not being actually able to detect a 0 value (since the values from other tiers are averaged).
Could you check if the alert works better with latest nightlies?

I deploy from Debian packages from netdata/netdata - Packages · packagecloud so I will have to wait for the next release (which will give me some time to evaluate the changes I already made). I will keep you updated.

Thanks again for your help

nodiscc · October 25, 2023, 8:40pm

Hi,

The problem now seems to be resolved by adding unaligned to the lookup expression (lookup: min -1d unaligned of upgradable) on netdata 1.43.0. The collector update every frequency has been kept to 600, the alarm every has been kept to 60s.

The alarm is cleared immediately when the collector updates the chart and the new minimum value becomes 0.

Thanks @Manolis_Vasilakis @ilyam8 for your help.

As a last request, where can I read more about the unaligned operator? I could not find anything on Configure alerts | Learn Netdata

Thanks again!

Manolis_Vasilakis · October 31, 2023, 1:12pm

Hi @nodiscc glad to see it worked.

For unaligned, there is some more information here → Database queries/lookup | Learn Netdata

Topic		Replies	Views
Marking metrics or collectors as required for alerting purposes General feedback	2	356	September 7, 2023
Need help on creating basic custom alarm. Help agent , alerts	3	562	February 15, 2022
Health alarm created in a netdata docker node not listed in Netdata Cloud -> Alert Configurations Help agent , cloud , alerts	10	1050	July 25, 2022
Alert Configuration Question Help agent	4	704	September 7, 2020
Logging frequency Help agent-collector , agent	5	1285	January 14, 2021

Trigger an alarm when a chart never reaches zero over the last 24 hours

Problem/Question

Relevant docs you followed/actions you took to solve the issue

Environment/Browser/Agent’s version etc

What I expected to happen

Related topics