Health alarm created in a netdata docker node not listed in Netdata Cloud -> Alert Configurations

Hello everyone,

I created an alert to monitor the cpu temperature of a node that has netdata installed in a docker container. However, it does not appear in the list that is seen in the “Alert Configurations” tab in Netdata Cloud.

Additional details:

  • The container has a binded volume to a directory on the host. ie. /host/path/to/netdata/config:/etc/netdata
  • I created the alert inside the running container using docker exec, and using the edit-config command
  • I ran the netdatacli reload-health command to enable the alert inside the container.

Alert:

alarm: cpu_temp
on: sensors.cpu_thermal-virtual-_temperature
lookup: average -1m unaligned
units: Celsius
every: 10s
warn: $this > (($status >= $WARNING) ? (40) : (50))
crit: $this > (($status == $CRITICAL) ? (50) : (75))
info: CPU temperature monitoring

Missing Alert:

Any guidance in the right direction would be appreciated.

Regards,

J.

I’d first ensure that the agent itself accepted the new alert, by requesting localhost:19999/api/v1/alarms?all from within the container (or a browser, your call). If your alert doesn’t appear on that list, perhaps it was rejected for some reason that should have been logged in error.log

There’s a chance that health reload isn’t clever enough to send the newly discovered alert to the cloud. Your container should now have saved your alert to the mounted directory, so you should be able to restart the container safely. Try that to see if it appears. If that’s the case, then we’ll need to ensure that a health reload also does this properly. cc @Manolis_Vasilakis

Hi Christopher, thanks for your quick reply :raised_hands:

It was definitely rejected for some reason. Is not listed in localhost:19999/api/v1/alarms?all. However, I don’t see anything in /var/log/netdata/error.log inside the container and either in the docker logs :frowning:

error.log: empty

docker logs:

2022-07-22 19:06:04: netdata INFO  : MAIN : Command Clients = 1
2022-07-22T19:06:04.834536439Z
2022-07-22 19:06:04: netdata INFO  : MAIN : EOF found in command pipe.
2022-07-22 19:06:04: netdata INFO  : MAIN : COMMAND: Reloading HEALTH configuration.
2022-07-22 19:06:05: netdata INFO  : MAIN : COMMAND: Sending reply: "X0"
2022-07-22 19:06:05: netdata INFO  : MAIN : Command Clients = 0
2022-07-22T19:06:05.779725573Z

This is my “upgraded” alert config:

alarm: cpu_temp
on: sensors.cpu_thermal-virtual-_temperature
class: Utilization
type: System
component: CPU
os: linux
hosts: *
lookup: average -1m unaligned foreach user, system
units: Celsius
every: 10s
warn: $this > (($status >= $WARNING) ? (40) : (50))
crit: $this > (($status == $CRITICAL) ? (50) : (75))
info: CPU temperature monitoring
to: sysadmin

Hi @Juanra !

One reason for not loading the alert, is that the chart (on) is not correct. Can you check in localhost:19999/api/v1/charts that indeed the chart is present?

Hi Manolis, thanks for your message.

The chart is listed in the localhost:19999/api/v1/charts API.

"sensors.cpu_thermal-virtual-0_temperature": 		{
			"id": "sensors.cpu_thermal-virtual-0_temperature",
			"name": "sensors.cpu_thermal-virtual-0_temperature",
			"type": "sensors",
			"family": "temperature",
			"context": "sensors.temperature",
			"title": "Temperature (sensors.cpu_thermal-virtual-0_temperature)",
			"priority": 60000,
			"plugin": "python.d.plugin",
			"module": "sensors",
			"units": "Celsius",
			"data_url": "/api/v1/data?chart=sensors.cpu_thermal-virtual-0_temperature",
			"chart_type": "line",
			"duration": 142520,
			"first_entry": 1658463974,
			"last_entry": 1658606493,
			"update_every": 1,
			"dimensions": {
				"cpu_thermal-virtual-0_temp1": { "name": "temp1" }
			},
			"chart_variables": {
			},
			"green": null,
			"red": null,
			"alarms": {

			},
			"chart_labels": {

			}

Regards,

J.

Okay, nice, then can you try in the alert to use on: sensors.cpu_thermal-virtual-0_temperature ?

For some reason, the 0 was trimmed in the information I gave you. However, to avoid any confusion, even with the correct name I don’t get the alert in the localhost:19999/api/v1/alarms?all API.

Editing '/etc/netdata/health.d/cpu_temp.conf' ...

alarm: cpu_temp
on: sensors.cpu_thermal-virtual-0_temperature
class: Utilization
type: System
component: CPU
os: linux
hosts: *
lookup: average -1m unaligned foreach user, system
units: Celsius
every: 10s
warn: $this > (($status >= $WARNING) ? (40) : (50))
crit: $this > (($status == $CRITICAL) ? (50) : (75))
info: CPU temperature monitoring
to: sysadmin

I’ve been checking the logs again but I don’t see anything related to the sensors, except this line:

2022-07-24 22:31:02: charts.d: INFO: sensors: is disabled. Add a line with sensors=force in '/etc/netdata/charts.d.conf' to enable it (or remove the line that disables it).

I don’t know if the alert can be affected by the fact that I am using the python.d.plugin instead of lm-sensors.

Alert list:

{
	"hostname": "my-host-name",
	"latest_alarm_log_unique_id": xxxxxx,
	"status": true,
	"now": xxxxxxxxx,
	"alarms": {
		"ipv4.sockstat_tcp_mem.tcp_memory": {
		},
		"netfilter.conntrack_sockets.netfilter_conntrack_full": {
		},
		"system.ipc_semaphores.semaphores_used": {
		},
		"system.ipc_semaphore_arrays.semaphore_arrays_used": {
		},
		"system.entropy.lowest_entropy": {
		},
		"ipv4.tcpsock.tcp_connections": {
		},
		"ipv4.sockstat_tcp_sockets.tcp_orphans": {
		},
		"system.active_processes.active_processes": {
		},
		"netdata.dbengine_global_errors.10min_dbengine_global_fs_errors": {
		},
		"netdata.dbengine_global_errors.10min_dbengine_global_io_errors": {
		},
		"netdata.dbengine_global_errors.10min_dbengine_global_flushing_warnings": {
		},
		"netdata.dbengine_long_term_page_stats.10min_dbengine_global_flushing_errors": {
		},
		"ipv4.tcphandshake.1m_ipv4_tcp_resets_sent": {
		},
		"ipv4.tcphandshake.10s_ipv4_tcp_resets_sent": {
		},
		"ipv4.tcphandshake.1m_ipv4_tcp_resets_received": {
		},
		"ipv4.tcphandshake.10s_ipv4_tcp_resets_received": {
		},
		"ipv4.udperrors.1m_ipv4_udp_receive_buffer_errors": {
		},
		"ipv4.udperrors.1m_ipv4_udp_send_buffer_errors": {
		},
		"system.ram.ram_in_use": {
		},
		"mem.available.ram_available": {
		},
		"mem.oom_kill.oom_kill": {
		},
		"system.clock_sync_state.system_clock_sync_state": {
		},
		"system.softnet_stat.1min_netdev_backlog_exceeded": {
		},
		"system.softnet_stat.1min_netdev_budget_ran_outs": {
		},
		"system.load.load_cpu_number": {
		},
		"system.load.load_average_15": {
		},
		"system.load.load_average_5": {
		},
		"system.load.load_average_1": {
		},
		"cgroup_watchtower.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_plex.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_traefik.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_gluetun.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_bazarr.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_tautulli.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_searcharr.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_netdata.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_transmission.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_radarr.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_sonarr.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_jackett.cpu_limit.cgroup_10min_cpu_usage": {
		},
		"cgroup_watchtower.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_plex.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_traefik.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_gluetun.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_bazarr.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_tautulli.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_searcharr.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_netdata.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_transmission.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_radarr.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_sonarr.mem_usage.cgroup_ram_in_use": {
		},
		"cgroup_jackett.mem_usage.cgroup_ram_in_use": {
		},
		"disk_util.sda.10min_disk_utilization": {
		},
		"disk_backlog.sda.10min_disk_backlog": {
		},
		"disk_util.mmcblk0.10min_disk_utilization": {
		},
		"disk_backlog.mmcblk0.10min_disk_backlog": {
		},
		"net_packets.veth5dcdd7d.inbound_packets_dropped_ratio": {
		},
		"net_packets.veth5dcdd7d.outbound_packets_dropped_ratio": {
		},
		"net_packets.veth5dcdd7d.1m_received_packets_rate": {
		},
		"net_packets.veth5dcdd7d.10s_received_packets_storm": {
		},
		"netdata.runtime_sensors.python.d_job_last_collected_secs": {
		},
		"net.veth5dcdd7d.interface_speed": {
		},
		"net.veth5dcdd7d.1m_received_traffic_overflow": {
		},
		"net.veth5dcdd7d.1m_sent_traffic_overflow": {
		},
		"net_packets.veth3847448.inbound_packets_dropped_ratio": {
		},
		"net_packets.veth3847448.outbound_packets_dropped_ratio": {
		},
		"net_packets.veth3847448.1m_received_packets_rate": {
		},
		"net_packets.veth3847448.10s_received_packets_storm": {
		},
		"net.veth3847448.interface_speed": {
		},
		"net.veth3847448.1m_received_traffic_overflow": {
		},
		"net.veth3847448.1m_sent_traffic_overflow": {
		},
		"net_packets.vetha0072d0.inbound_packets_dropped_ratio": {
		},
		"net_packets.vetha0072d0.outbound_packets_dropped_ratio": {
		},
		"net_packets.vetha0072d0.1m_received_packets_rate": {
		},
		"net_packets.vetha0072d0.10s_received_packets_storm": {
		},
		"net.vetha0072d0.interface_speed": {
		},
		"net.vetha0072d0.1m_received_traffic_overflow": {
		},
		"net.vetha0072d0.1m_sent_traffic_overflow": {
		},
		"net_packets.veth6bfcc9c.inbound_packets_dropped_ratio": {
		},
		"net_packets.veth6bfcc9c.outbound_packets_dropped_ratio": {
		},
		"net_packets.veth6bfcc9c.1m_received_packets_rate": {
		},
		"net_packets.veth6bfcc9c.10s_received_packets_storm": {
		},
		"net.veth6bfcc9c.interface_speed": {
		},
		"net.veth6bfcc9c.1m_received_traffic_overflow": {
		},
		"net.veth6bfcc9c.1m_sent_traffic_overflow": {
		},
		"net_packets.vetha9b0eed.inbound_packets_dropped_ratio": {
		},
		"net_packets.vetha9b0eed.outbound_packets_dropped_ratio": {
		},
		"net_packets.vetha9b0eed.1m_received_packets_rate": {
		},
		"net_packets.vetha9b0eed.10s_received_packets_storm": {
		},
		"net.vetha9b0eed.interface_speed": {
		},
		"net.vetha9b0eed.1m_received_traffic_overflow": {
		},
		"net.vetha9b0eed.1m_sent_traffic_overflow": {
		},
		"net_packets.vethbe304c3.inbound_packets_dropped_ratio": {
		},
		"net_packets.vethbe304c3.outbound_packets_dropped_ratio": {
		},
		"net_packets.vethbe304c3.1m_received_packets_rate": {
		},
		"net_packets.vethbe304c3.10s_received_packets_storm": {
		},
		"net.vethbe304c3.interface_speed": {
		},
		"net.vethbe304c3.1m_received_traffic_overflow": {
		},
		"net.vethbe304c3.1m_sent_traffic_overflow": {
		},
		"net_packets.br-823d5e3bc379.inbound_packets_dropped_ratio": {
		},
		"net_packets.br-823d5e3bc379.outbound_packets_dropped_ratio": {
		},
		"net_packets.br-823d5e3bc379.1m_received_packets_rate": {
		},
		"net_packets.br-823d5e3bc379.10s_received_packets_storm": {
		},
		"net.br-823d5e3bc379.interface_speed": {
		},
		"net.br-823d5e3bc379.1m_received_traffic_overflow": {
		},
		"net.br-823d5e3bc379.1m_sent_traffic_overflow": {
		},
		"net_packets.vethe1095af.inbound_packets_dropped_ratio": {
		},
		"net_packets.vethe1095af.outbound_packets_dropped_ratio": {
		},
		"net_packets.vethe1095af.1m_received_packets_rate": {
		},
		"net_packets.vethe1095af.10s_received_packets_storm": {
		},
		"disk_inodes._var_log_netdata.disk_inode_usage": {
		},
		"disk_space._var_log_netdata.disk_space_usage": {
		},
		"net.vethe1095af.interface_speed": {
		},
		"net.vethe1095af.1m_received_traffic_overflow": {
		},
		"net.vethe1095af.1m_sent_traffic_overflow": {
		},
		"disk_inodes._var_lib_netdata.disk_inode_usage": {
		},
		"disk_space._var_lib_netdata.disk_space_usage": {
		},
		"net_packets.br-92a323ac2481.inbound_packets_dropped_ratio": {
		},
		"net_packets.br-92a323ac2481.outbound_packets_dropped_ratio": {
		},
		"net_packets.br-92a323ac2481.1m_received_packets_rate": {
		},
		"net_packets.br-92a323ac2481.10s_received_packets_storm": {
		},
		"disk_inodes._var_cache_netdata.disk_inode_usage": {
		},
		"disk_space._var_cache_netdata.disk_space_usage": {
		},
		"disk_inodes._etc_netdata.disk_inode_usage": {
		},
		"net.br-92a323ac2481.interface_speed": {
		},
		"net.br-92a323ac2481.1m_received_traffic_overflow": {
		},
		"net.br-92a323ac2481.1m_sent_traffic_overflow": {
		},
		"disk_space._etc_netdata.disk_space_usage": {
		},
		"net_drops.eth0.inbound_packets_dropped": {
		},
		"net_drops.eth0.outbound_packets_dropped": {
		},
		"net_packets.eth0.inbound_packets_dropped_ratio": {
		},
		"net_packets.eth0.outbound_packets_dropped_ratio": {
		},
		"net_packets.eth0.1m_received_packets_rate": {
		},
		"net_packets.eth0.10s_received_packets_storm": {
		},
		"disk_inodes._.disk_inode_usage": {
		},
		"disk_space._.disk_space_usage": {
		},
		"net.eth0.interface_speed": {
		},
		"net.eth0.1m_received_traffic_overflow": {
		},
		"net.eth0.1m_sent_traffic_overflow": {
		},
		"system.cpu.10min_cpu_usage": {
		},
		"system.cpu.10min_cpu_iowait": {
		},
		"system.cpu.20min_steal_cpu": {
		}
	}
}
 Closing connection 0

Btw, is there a safe way I can share my logs “as-is” with you?

Hi @Juanra !

Yes, you can share logs at manolis @ netdata.cloud .

However, looking again at your conf, I missed this line:

lookup: average -1m unaligned foreach user, system.

You will need to use the dimensions of chart sensors.cpu_thermal-virtual-0_temperature instead here, i.e. most likely:

lookup: average -1m unaligned of temp1.

Can you try that? Also, We’ll check in general, seems that we need some stock alerts on sensors…

This Manolis!

It’s working now.

Would you mind explaining how you get to this… I’m learning so I want to understand to avoid future mistakes in my configs.

Also, I didn’t find anything pointing to this in the logs (I will send you the files).

Many thanks for everything! I really appreciate it.

Hi @Juanra !

Glad it worked!

In a nutshell, what an alert needs is to match initially the chart, and also the dimensions of the chart you wish to get values from and calculate your alert.

However, it’s a bit more complicated than that, since besides alarms you can also have templates, i.e. alarms that you define such as to work on charts created for the same reason but for multiple, let’s say “devices”. (i.e. you can have the temperature monitored for cpu, nvme drives, motherboard, etc).

I will point you to our documentation here Configure health alarms | Learn Netdata and here Health configuration reference | Learn Netdata where there is a complete break down of all the available configuration entries.

We would be very interested to hear from your perspective, especially if you find our documentation lacking in any way so we can improve!

Thank you very much!

1 Like