I would like to find out which application triggering the alert but I could not find a way to print the dimension out in the alert notification. The following is my cpu.conf:
template: 1min_appcpu_usage
on: apps.cpu
os: linux
hosts: *
lookup: average -1m unaligned of *
units: %
every: 10s
warn: $this > (($status >= $WARNING) ? (190) : (400))
crit: $this > (($status == $CRITICAL) ? (401) : (3200))
delay: down 15m multiplier 1.5 max 1h
info: cpu utilization for the last minute
to: sysadmin
About the error you are having, I remembered today that when the multihost was merged, we started to have problems with foreach alarms, but 8 days ago the PR Fixed issue with missing alarms by stelfrag · Pull Request #9712 · netdata/netdata · GitHub was merged and the problem was fixed. If you install the nightly Netdata using RPMs or compiling Netdata using kickstart, you will have your alarms working as expected.
We would like to apologize you and all our users for this problem.
Checked the alarm via localhost:19999/api/v1/alarms?all
{
“hostname”: “XXXXXXXXXX”,
“latest_alarm_log_unique_id”: 1597170798,
“status”: true,
“now”: 1597798827,
“alarms”: {
}
}
no data is found in the alarm. Again, my conf file is in the following:
more APPS_CPU.conf
alarm: appscpu1min
on: apps.cpu
os: linux
hosts: * #lookup: average -10m percentage foreach Spectre #lookup: average -1m percentage foreach Spectre
lookup: average -1m percentage foreach *
unit: %
every: 1m
every: 10s
warn: $this > (($status >= $WARNING) ? (180) : (190))
crit: $this > (($status == $CRITICAL) ? (190) : (290))
delay: down 15m multiplier 1.5 max 1h
info: App CPU Usage above 190 as warning and 290 as critical for the last 10 minutes
to: silent
lookup: average -30s percentage foreach nfs, email
And after to execute the request, I observed that I also had the expected dimensions.
Finally I extended the test for 4 dimensions:
lookup: average -30s percentage foreach nfs, email, ssh, kernel
And again I got all alarms.
I would like to call attention that if an dimension was not created, the alarm cannot be created.
Considering the tests that I made on my Netdata, I have two questions for you:
1 - Are Spectre,Simvision the exact name that you defined inside your apps_groups.conf?
2 - Do you see these dimensions when you do the request http://localhost:19999/api/v1/data?chart=apps.cpu
3 - How did you install your Netdata?
#not working. alarm has diemension names but the status showed uninit and removedand no alarm lookup: average -30s percentage foreach Spectre,Simvision
This one should be working. In fact, “foreach” is the only use case where it makes sense to have the specific dimension that triggered the alarm inside the notification. No other alarm configuration has that ambiguity.
So getting it uninitialized sounds like a bug and we also need to add something new, so that the dimension that does trigger the alarm appears in the notifications. (see Custom | Learn Netdata for the variables that are currently sent to the notifications script, I suggest we add the dimension name to ${info} and/or ${alarm}.
That’s correct. I don’t see the app name in both email and alarm in the web UI. For the time being, I just manually configured each of the apps defined in the apps_groups.conf. That works but troublesome.
So basically the problem is that you get the alarm, but you don’t know which app triggers it because the app name doesn’t show up? I thought it would pop up under db lookup in the alarm, but I will have to double check it’s working as expected.
The dynamic alarm setting just does not work. See my following configuration:
alarm: app_cpu
on: apps.cpu #works but it does not provide dimension name in the alarm lookup: average -30s percentage of S* #not working. alarm has the dimension name but the status showed uninit and removed and no alarm lookup: average -30s percentage foreach S* #not working. alarm has diemension names but the status showed uninit and removedand no alarm lookup: average -30s percentage foreach Spectre,Simvision #works but it does not provide dimension name in the alarm lookup: average -30s percentage Spectre,Simvision
lookup: average -30s percentage foreach *
unit: %
every: 3s
warn: $this > 1
crit: $this > 110
to: sysadmin
Goal: I expect the dimension name would be somewhere in the alarm. It takes too much time to look into this dynamic alarm. For the time being, I just need to configure an alarm for each app which defined in apps_groups.conf
We monitor unhealthy target groups in our AWS Application Load Balancers. Target groups are dynamic, they come and go, they are not a static list. Each target group status is a dimension in a chart, see below.
We have set an alert rule using foreach *that iterates on the dynamic list of target groups, to check if any group has any unhealthy targets. See below.
template: alb_unhealthy_targets
on: alb.state_unhealthy
class: Utilization
type: System
component: AWS
os: linux
hosts: *
lookup: min -10s foreach *
every: 10s
warn: $this > 0
crit: $this > 0
summary: ALB unhealthy targets
info: ALB unhealthy targets
to: sysadmin
The alert is configured to send a Slack notification. It works, however the message received on Slack does not contain the name of the dimension that triggered the alert, see below.