I am running netdata agent on a CentOs Linux(Version 8) system. I installed netdata (netdata v1.31.0-532-nightly) using script. There are multiple containers running in my system. I use Prometheus to store and visualize data from netdata.
Problem/Question
I connected netdata with prometheus(which scrape its targets every 20 s).I want to generate an alert from prometheus, when a container goes down. The problem is I cannot find a matrics, that will do the above purpose.
I tried using
"netdata_cgroup_mem_usage_MiB_average{dimension=‘ram’,chart=‘cgroup_name_of_container.mem_usage’} " . The problem is this matrics is still sending values (non-zero values) to prometheus even after the container goes down for about 5 more minutes. After 5 minutes this matrics will not send any more data, at which point I can use the absent() function in prometheus to trigger an alert. But 5 minutes is too much of a delay for me. Is there any matrics I can monitor to trigger an alert when the container goes down?
I also tried
“netdata_cgroup_cpu_percentage_average{chart=“cgroup_name_of_container.cpu”,family=“cpu”,dimension=“user”}” matrics, which also goes absent only after 5 minutes of stopping the container
What I expected to happen
The absence of the matrics as soon as the container goes down.
The cAdvisor way
I was previously using cadvisor. There I used the matrics," container_start_time_seconds{name=“name_of_container”}" and this particular matrics will go down on the next scrape of prometheus.
Netdata stops exposing container metrics as soon as the container gets stopped. The 5 minutes interval you mentioned is the result of --query.lookback-delta configuration option.
A time-series goes “stale” when it has no samples in the last 5 minutes.
Thank You for your replay. --query.lookback-delta configuration change worked. (I am still confused about why the cadvisor matrics (container_start_time_seconds) is not affected by this configuration though )
By container is down I mean exited docker containers…yes.
I tried absent_over_time() but I do not think absent_over_time() will filter out the stale data. So I think the best solution will be to change --query.lookback-delta option.
will be absent after the staleness time. This means if a container is stopped and not resumed, I will not be receiving alerts any more after the staleness time. So I modified the query to this.(it involves hardcoding the name of container unfortunately).
(((time()-timestamp(netdata_cgroup_mem_usage_MiB_average{dimension="ram", chart=~".*"}))>60) or (absent(timestamp(netdata_cgroup_mem_usage_MiB_average{dimension='ram',chart=~'cgroup_name_of_container.mem_usage'}))>=0))>=0