Netdata Community

Monitoring When a container is down

Environment

I am running netdata agent on a CentOs Linux(Version 8) system. I installed netdata (netdata v1.31.0-532-nightly) using script. There are multiple containers running in my system. I use Prometheus to store and visualize data from netdata.

Problem/Question

I connected netdata with prometheus(which scrape its targets every 20 s).I want to generate an alert from prometheus, when a container goes down. The problem is I cannot find a matrics, that will do the above purpose.
I tried using
"netdata_cgroup_mem_usage_MiB_average{dimension=‘ram’,chart=‘cgroup_name_of_container.mem_usage’} " . The problem is this matrics is still sending values (non-zero values) to prometheus even after the container goes down for about 5 more minutes. After 5 minutes this matrics will not send any more data, at which point I can use the absent() function in prometheus to trigger an alert. But 5 minutes is too much of a delay for me. Is there any matrics I can monitor to trigger an alert when the container goes down?
I also tried
“netdata_cgroup_cpu_percentage_average{chart=“cgroup_name_of_container.cpu”,family=“cpu”,dimension=“user”}” matrics, which also goes absent only after 5 minutes of stopping the container

What I expected to happen

The absence of the matrics as soon as the container goes down.

The cAdvisor way

I was previously using cadvisor. There I used the matrics," container_start_time_seconds{name=“name_of_container”}" and this particular matrics will go down on the next scrape of prometheus.

Hi, @arjun_vijayan.

Netdata stops exposing container metrics as soon as the container gets stopped. The 5 minutes interval you mentioned is the result of --query.lookback-delta configuration option.

A time-series goes “stale” when it has no samples in the last 5 minutes.

Query engine doesn’t return “stale” metrics.


Monitoring When a container is down

You mean exited docker containers?

@arjun_vijayan I haven’t tried, but check absent_over_time()
There you can use a range vector with [1m:].

Thank You for your replay. --query.lookback-delta configuration change worked. (I am still confused about why the cadvisor matrics (container_start_time_seconds) is not affected by this configuration though )
By container is down I mean exited docker containers…yes.

I tried absent_over_time() but I do not think absent_over_time() will filter out the stale data. So I think the best solution will be to change --query.lookback-delta option.

@arjun_vijayan I doubt changing --query.lookback-delta is a good idea. At least it is not recommended.

(query.staleness-delta was renamed to query.lookback-delta)

This 5 minutes is controlled by the -query.staleness-delta flag. Changing it
is rarely a good idea.

absent_over_time()

My idea was to set a time range, e.g. [1m:]. I will test it locally.

cadvisor matrics (container_start_time_seconds)

We need to check how cadvisor calculates this meric.

@arjun_vijayan I did some tests and it seems time() + timestamp() will do. But it shows all the missing containers (stopped and deleted)

# missing for 60+ seconds containers
time() - timestamp(netdata_cgroup_mem_usage_MiB_average{dimension="ram", chart=~".*"}) > 60

This is an excellent solution. However,

time() - timestamp(netdata_cgroup_mem_usage_MiB_average{dimension="ram", chart=~".*"}) > 60

will be absent after the staleness time. This means if a container is stopped and not resumed, I will not be receiving alerts any more after the staleness time. So I modified the query to this.(it involves hardcoding the name of container unfortunately).

 (((time()-timestamp(netdata_cgroup_mem_usage_MiB_average{dimension="ram", chart=~".*"}))>60) or (absent(timestamp(netdata_cgroup_mem_usage_MiB_average{dimension='ram',chart=~'cgroup_name_of_container.mem_usage'}))>=0))>=0