Monitoring When a container is down

arjun_vijayan · November 26, 2021, 11:36am

Environment

I am running netdata agent on a CentOs Linux(Version 8) system. I installed netdata (netdata v1.31.0-532-nightly) using script. There are multiple containers running in my system. I use Prometheus to store and visualize data from netdata.

Problem/Question

I connected netdata with prometheus(which scrape its targets every 20 s).I want to generate an alert from prometheus, when a container goes down. The problem is I cannot find a matrics, that will do the above purpose.
I tried using
"netdata_cgroup_mem_usage_MiB_average{dimension=‘ram’,chart=‘cgroup_name_of_container.mem_usage’} " . The problem is this matrics is still sending values (non-zero values) to prometheus even after the container goes down for about 5 more minutes. After 5 minutes this matrics will not send any more data, at which point I can use the absent() function in prometheus to trigger an alert. But 5 minutes is too much of a delay for me. Is there any matrics I can monitor to trigger an alert when the container goes down?
I also tried
“netdata_cgroup_cpu_percentage_average{chart=“cgroup_name_of_container.cpu”,family=“cpu”,dimension=“user”}” matrics, which also goes absent only after 5 minutes of stopping the container

What I expected to happen

The absence of the matrics as soon as the container goes down.

The cAdvisor way

I was previously using cadvisor. There I used the matrics," container_start_time_seconds{name=“name_of_container”}" and this particular matrics will go down on the next scrape of prometheus.

ilyam8 · November 29, 2021, 4:15pm

Hi, @arjun_vijayan.

Netdata stops exposing container metrics as soon as the container gets stopped. The 5 minutes interval you mentioned is the result of --query.lookback-delta configuration option.

A time-series goes “stale” when it has no samples in the last 5 minutes.

Query engine doesn’t return “stale” metrics.

Monitoring When a container is down

You mean exited docker containers?

ilyam8 · November 29, 2021, 4:29pm

@arjun_vijayan I haven’t tried, but check absent_over_time()
There you can use a range vector with [1m:].

arjun_vijayan · November 30, 2021, 5:53am

Thank You for your replay. --query.lookback-delta configuration change worked. (I am still confused about why the cadvisor matrics (container_start_time_seconds) is not affected by this configuration though )
By container is down I mean exited docker containers…yes.

arjun_vijayan · November 30, 2021, 5:55am

I tried absent_over_time() but I do not think absent_over_time() will filter out the stale data. So I think the best solution will be to change --query.lookback-delta option.

ilyam8 · November 30, 2021, 11:56am

@arjun_vijayan I doubt changing --query.lookback-delta is a good idea. At least it is not recommended.

(query.staleness-delta was renamed to query.lookback-delta)

This 5 minutes is controlled by the -query.staleness-delta flag. Changing it
is rarely a good idea.

absent_over_time()

My idea was to set a time range, e.g. [1m:]. I will test it locally.

cadvisor matrics (container_start_time_seconds)

We need to check how cadvisor calculates this meric.

ilyam8 · November 30, 2021, 5:22pm

@arjun_vijayan I did some tests and it seems time() + timestamp() will do. But it shows all the missing containers (stopped and deleted)

# missing for 60+ seconds containers
time() - timestamp(netdata_cgroup_mem_usage_MiB_average{dimension="ram", chart=~".*"}) > 60

arjun_vijayan · December 1, 2021, 4:52am

This is an excellent solution. However,

time() - timestamp(netdata_cgroup_mem_usage_MiB_average{dimension="ram", chart=~".*"}) > 60

will be absent after the staleness time. This means if a container is stopped and not resumed, I will not be receiving alerts any more after the staleness time. So I modified the query to this.(it involves hardcoding the name of container unfortunately).

 (((time()-timestamp(netdata_cgroup_mem_usage_MiB_average{dimension="ram", chart=~".*"}))>60) or (absent(timestamp(netdata_cgroup_mem_usage_MiB_average{dimension='ram',chart=~'cgroup_name_of_container.mem_usage'}))>=0))>=0

Topic		Replies	Views
Docker Container down Alert configuration Help agent , alerts	9	294	July 9, 2024
Alert for another docker container crash Help agent-alarms , container-monitoring , agent	3	2344	February 10, 2021
All container in one graph Help agent , dashboards , configuration	1	543	October 7, 2022
Alert when non-ephemeral nodes go stale Help agent , alerts	7	363	April 5, 2024
Prometheus endpoints on Netdata agent General	4	529	February 22, 2024

Monitoring When a container is down

Environment

Problem/Question

What I expected to happen

The cAdvisor way

Related topics