Feedback or suggestion on: Export metrics to Prometheus

Hi,
I think there’s an error in the rules example for alerting about high cpu usage. The documentation says:

  • alert: node_high_cpu_usage_70
    expr: avg(rate(netdata_cpu_cpu_percentage_average{dimension=“idle”}[1m])) by (job) > 70
    for: 1m
    annotations:
    description: ‘{{ $labels.job }} on ‘’{{ $labels.job }}’’ CPU usage is at {{ humanize $value }}%.’
    summary: CPU alert for container node ‘{{ $labels.job }}’

but this expression doesn’t return a good value in my opinion. Are you sure about it ? it seems very low for a cpu usage percentage, don’t you think ? However I can’t find the good expression, so if you can help with the correct one…

Thanks !

Comete

Hi, @comete-geek. Thanks for joining our community, hope you will enjoy being here!


I think there’s an error in the rules example for alerting about high cpu usage.

Could you provide a link to the document you are referring to?

Hi,
yes sorry…
The rules example is in both:

https://learn.netdata.cloud/docs/agent/backends/prometheus#installing-prometheus

Section “Install nodes.yml”

Thanks, I think you are right.

Do you use

      # sources: as-collected | raw | average | sum | volume
      # default is: average
      #source: [as-collected]

I see it is not used in Install prometheus.yml, but that CPU rule expression expects it (there is rate function). Apart from it the selector supposed to be !="idle".

Thanks for your answer.
I use “average”.

@comete-geek

not an expert, so re-check it, but I think you need something like that wrong

sum(avg_over_time(netdata_cpu_cpu_percentage_average{dimension!="idle"}[10m])) by (job)

thanks but now some values are above 100% (120 or 130)

Yeah, my bad, netdata_cpu_cpu_percentage_average is per core.


That is 10min_cpu_usage alarm wrong

sum(avg_over_time(netdata_system_cpu_percentage_average{dimension=~"(user|system|softirq|irq|guest)"}[10m])) by (job)

I hope I’ve made it correct this time :sweat_smile:

edit: No, I haven’t :cry:

1 Like

Yes, it seems good ! thanks a lot !

Nice! Then we need to update our docs.

Maybe I talk to fast :stuck_out_tongue:
I’ve received an alert with CPU usage at 108.2%

Then i need to give it another try :grinning_face_with_smiling_eyes: let me think

@comete-geek my another attempt is

sum(sum_over_time(netdata_system_cpu_percentage_average{dimension=~"(user|system|softirq|irq|guest)"}[10m])) by (job) / sum(count_over_time(netdata_system_cpu_percentage_average{dimension="idle"}[10m])) by (job)

It seems good this time :slight_smile: Thanks a lot for your help !

Fixed in [docs] fix prometheus node cpu alert rule by ilyam8 · Pull Request #11309 · netdata/netdata · GitHub