Monitoring edit backlog help

luison · October 28, 2023, 7:48am

With the exception to the alarm editing options I am a very happy netdata user. I understand that our environment (containers running on Proxmox VE) has some special configurations needed, particularly as we use LVM and create many snapshots for backup purposes that tend to trigger too many alerts.

I am trying to reduce, rather than silent most of the many alerts I receive and I am stack with this one. I know there is a documentation regarding alarm editing but there is a lot of margin to improve it from my point of view.

I keep getting this kind of alerts:

10min_disk_backlog

on xxxxx
124132.79 ms

Details: average of the kernel estimated disk backlog, for the last 15 minutes

Chart: disk_backlog.dm-16-0l080Zl6JNEpMp03Bz3zT6HRhbRSYH8t53LqzJ7r8q3kSQk3JU9KIzvP9kgYO33G-tpool
Context: disk.backlog
Family:
Raised to critical, for 0 seconds

I edited the alert config with:

template: 10min_disk_backlog
      on: disk.backlog
      os: linux
   hosts: *
families: !pve-vm--* !pve-thin-t* !dm-*
  lookup: average -20m unaligned
   units: ms
   every: 10m
   green: 6000
     red: 20000
    warn: $this > $green * (($status >= $WARNING)  ? (0.7) : (1))
    crit: $this > $red   * (($status == $CRITICAL) ? (0.7) : (1))
   delay: down 60m multiplier 1.2 max 2h
    info: average of the kernel estimated disk backlog, for the last 60 minutes
      to: sysadmin

But obviously as the alert has no “family” on it, I understand is not applying, so not sure how to remove all the dm-?? alerts.

I also have the impression this alert has started with a recent update. Currently running vo. v1.43.1

The alert editing feature or at least an assistant, would be to me the most needed improvemente to netdata

ilyam8 · October 28, 2023, 11:57am

Hi, @luison. There are 2 options:

disable this alarm. I personally don’t really like because of its ambiguity in general.
exclude specific devices using the chart labels filter.

See chart labels usage example. You will need the same but use device instead of mount_point. The device label values can be found:

luison · October 28, 2023, 12:17pm

Thanks. As you say yes, I think too many disk related alerts are activated by default and duplicate info.
In this case I understand the dm-XX units are LVM thin pool snapshots. Other alerts I get affect LVM-thin metadata too.

I wil likely endup disableing it completely but this kind of alerts should perhaps be defined the other way round. Just specify which phisical or logical units to monitor . In our case all physicial disks are on a raid and then on LVM so monitoring partitions like /sdb2 is a non-sense, as it likely is monitoring ALL LVM logical partitions.

As per your approach, not sure it works for me. I don’t think dm-16-0l080Zl6JNEpMp03Bz3zT6HRhbRSYH8t53LqzJ7r8q3kSQk3JU9KIzvP9kgYO33G-tpool is “mounted” at all so how am I suppose to label it if this units name gets created and deleted?

luison · November 7, 2023, 1:38pm

I ended having to disable all this alerts. Shame that the alert adjusting process is so complicated in netdata

I could not find a way to disable it by families or labels either:

If any one can help I will try again:

Calling alert is like: (same with backlog)
average of all values of chart disk_util.dm-12-0l080Zl6JNEpMp03Bz3zT6HRhbRSYH8tcOdVfWcq3R5iBcckaUUYF3hi69ijMyO2, starting 20 minutes ago and up to now, with options unaligned

Topic		Replies	Views
Netdata alarm notifications not stopping even after updating "to: " field from sysadmin to silent Help agent	14	1872	April 9, 2021
alarms: a comprehension problem Help cloud	5	517	March 23, 2023
Improve alerts for netdata monitoring Help agent , cloud , exporter , alerts , configuration , dashboard , platform	3	615	October 4, 2022
Alarms web_log_1m_redirects CRITICAL & web_log_1m_successful CRITICAL Help agent	9	1916	October 2, 2023
Is there a way to upload logging alarms to NetData Cloud? General	2	582	May 28, 2021

Monitoring edit backlog help

Related topics