Feedback or suggestion on: Export metrics to Prometheus

comete-geek · June 30, 2021, 3:29pm

Hi,
I think there’s an error in the rules example for alerting about high cpu usage. The documentation says:

alert: node_high_cpu_usage_70
expr: avg(rate(netdata_cpu_cpu_percentage_average{dimension=“idle”}[1m])) by (job) > 70
for: 1m
annotations:
description: ‘{{ $labels.job }} on ‘’{{ $labels.job }}’’ CPU usage is at {{ humanize $value }}%.’
summary: CPU alert for container node ‘{{ $labels.job }}’

but this expression doesn’t return a good value in my opinion. Are you sure about it ? it seems very low for a cpu usage percentage, don’t you think ? However I can’t find the good expression, so if you can help with the correct one…

Thanks !

Comete

ilyam8 · July 1, 2021, 8:14am

Hi, @comete-geek. Thanks for joining our community, hope you will enjoy being here!

I think there’s an error in the rules example for alerting about high cpu usage.

Could you provide a link to the document you are referring to?

comete-geek · July 1, 2021, 8:28am

Hi,
yes sorry…
The rules example is in both:

https://learn.netdata.cloud/docs/agent/backends/prometheus#installing-prometheus

Section “Install nodes.yml”

ilyam8 · July 1, 2021, 10:03am

Thanks, I think you are right.

Do you use

      # sources: as-collected | raw | average | sum | volume
      # default is: average
      #source: [as-collected]

I see it is not used in Install prometheus.yml, but that CPU rule expression expects it (there is rate function). Apart from it the selector supposed to be !="idle".

comete-geek · July 1, 2021, 10:15am

Thanks for your answer.
I use “average”.

ilyam8 · July 1, 2021, 10:26am

@comete-geek

not an expert, so re-check it, ~~but I think you need something like that~~ wrong

sum(avg_over_time(netdata_cpu_cpu_percentage_average{dimension!="idle"}[10m])) by (job)

comete-geek · July 1, 2021, 10:32am

thanks but now some values are above 100% (120 or 130)

ilyam8 · July 1, 2021, 10:39am

Yeah, my bad, netdata_cpu_cpu_percentage_average is per core.

~~That is 10min_cpu_usage alarm~~ wrong

sum(avg_over_time(netdata_system_cpu_percentage_average{dimension=~"(user|system|softirq|irq|guest)"}[10m])) by (job)

I hope I’ve made it correct this time

edit: No, I haven’t

comete-geek · July 1, 2021, 12:23pm

Yes, it seems good ! thanks a lot !

ilyam8 · July 1, 2021, 12:45pm

Nice! Then we need to update our docs.

comete-geek · July 1, 2021, 12:52pm

Maybe I talk to fast
I’ve received an alert with CPU usage at 108.2%

ilyam8 · July 1, 2021, 12:55pm

Then i need to give it another try let me think

ilyam8 · July 1, 2021, 3:00pm

@comete-geek my another attempt is

sum(sum_over_time(netdata_system_cpu_percentage_average{dimension=~"(user|system|softirq|irq|guest)"}[10m])) by (job) / sum(count_over_time(netdata_system_cpu_percentage_average{dimension="idle"}[10m])) by (job)

comete-geek · July 1, 2021, 8:30pm

It seems good this time Thanks a lot for your help !

ilyam8 · July 2, 2021, 10:57am

Fixed in [docs] fix prometheus node cpu alert rule by ilyam8 · Pull Request #11309 · netdata/netdata · GitHub

Topic		Replies	Views
Don't send some metrics to prometheus Help agent	7	908	April 27, 2021
[feat] Prometheus: exporter: allow to disable average for some metrics Help	5	471	September 5, 2023
Prometheus output reports disk usage differently than Netdata Help agent	3	1040	April 6, 2021
Prometheus endpoints on Netdata agent General	4	530	February 22, 2024
Alert Configuration Question Help agent	4	682	September 7, 2020

Feedback or suggestion on: Export metrics to Prometheus

Related topics