Flaky cgroups graphs and hostname on Balena

Environment

We want to setup a fleet of devices running on Balena with Netdata.
We currently have 2 issues.

Problem/Question

Issue 1:

On our deployment, we have a few containers including the netdata container. When we deploy to a new device, the cgroups are mostly not showing on the dashboard. When we do some reboots and restarts of the netdata container, they start to appear. But then they can disappear again after a reboot or container restart.

Issue 2:

The hostname that is shown on top of the netdata dashboard is a random number on balena. When the container restarts, the number is generated again. We know we can set it using the docker hostname argument but we want to use Balena’s devicename for each device. I did an attempt to set it before we start netdata inside the launch.sh script. But that doens’t seems to work.

I created a minimal deployment for testing at GitHub - mrtncls/netdata-on-balena-test

What I expected to happen

The cgroups should always be show for each container.
The hostname to be the Belena devicename.

cgroup discovery can take quite a long time (up to several minutes), especially when there are a lot of cgroups (not just visible, but the total number) on a monitored host. There is also a timeout between discoveries and limit for the total number of processed cgroups.

[plugin:cgroups]
    check for new cgroups every = 10
    max cgroups to allow = 1000

Try not to reboot the Netdata container but to refresh the dashboard.

The container is now running for 25 minutes.
I did a ctrl+f5 refresh on chrome but still no data.

The times it worked after reboot, the cgroups were almost immediately available.

The log shows the plugin started but nothing else related to cgroups.
Can I activate verbose logging for this plugin?
What log should I look for?

2022-01-27 16:25:23: netdata INFO  : PLUGIN[tc] : thread created with task id 223
2022-01-27 16:25:23: netdata INFO  : PLUGIN[tc] : set name of thread 223 to PLUGIN[tc]
2022-01-27 16:25:23: netdata INFO  : PLUGIN[cgroups] : thread created with task id 222
2022-01-27 16:25:23: netdata INFO  : PLUGIN[cgroups] : set name of thread 222 to PLUGIN[cgroups]
2022-01-27 16:25:23: netdata INFO  : ACLK_Main : thread created with task id 226
2022-01-27 16:25:23: netdata INFO  : ACLK_Main : set name of thread 226 to ACLK_Main
2022-01-27 16:25:23: netdata INFO  : ACLK_Main : Starting ACLK-NG
...
2022-01-27 16:50:05: 80: 229 '[52.4.252.97]:56820' 'DATA' (sent/all = 127/130 bytes -2%, prep/sent/total = 0.06/0.07/0.13 ms) 200 '//api/v1/alarms'
2022-01-27 16:50:16: 80: 229 '[52.4.252.97]:56820' 'DATA' (sent/all = 126/130 bytes -3%, prep/sent/total = 0.04/0.06/0.10 ms) 200 '//api/v1/alarms'
2022-01-27 16:50:27: 80: 229 '[52.4.252.97]:56820' 'DATA' (sent/all = 127/130 bytes -2%, prep/sent/total = 0.10/0.08/0.18 ms) 200 '//api/v1/alarms'
2022-01-27 16:50:28: 82: 229 '[localhost]:51792' 'CONNECTED'
2022-01-27 16:50:28: 82: 229 '[localhost]:51792' 'DISCONNECTED'
2022-01-27 16:50:28: 82: 229 '[localhost]:51792' 'DATA' (sent/all = 4659/4659 bytes -0%, prep/sent/total = 0.30/0.44/0.74 ms) 200 '/api/v1/info'
2022-01-27 16:50:38: 80: 229 '[52.4.252.97]:56820' 'DATA' (sent/all = 126/130 bytes -3%, prep/sent/total = 0.06/0.05/0.11 ms) 200 '//api/v1/alarms'
2022-01-27 16:50:49: 80: 229 '[52.4.252.97]:56820' 'DATA' (sent/all = 127/130 bytes -2%, prep/sent/total = 0.04/0.06/0.10 ms) 200 '//api/v1/alarms'
2022-01-27 16:51:00: 80: 229 '[52.4.252.97]:56820' 'DATA' (sent/all = 126/130 bytes -3%, prep/sent/total = 0.06/0.06/0.12 ms) 200 '//api/v1/alarms'
2022-01-27 16:51:03: 81: 229 '[52.4.252.97]:56866' 'DISCONNECTED'
2022-01-27 16:51:03: 79: 229 '[52.4.252.97]:56800' 'DISCONNECTED'
2022-01-27 16:51:03: 78: 229 '[52.4.252.97]:56794' 'DISCONNECTED'
2022-01-27 16:51:03: 77: 229 '[52.4.252.97]:56782' 'DISCONNECTED'
2022-01-27 16:51:03: 68: 229 '[52.4.252.97]:59864' 'DISCONNECTED'

Well, there is nothing helpful in the log you sent. Please do grep cgroup, are there any lines with cgroup-name.sh when cgroups are not available in the dashboard? Any information about the thread being stopped?

The cgroup task keeps running but no logs.
When the cgroups are not available, I don’t see the cgroup-name.sh logs.

root@289f739:~# balena logs netdata_4487554_2052594 2>&1 | grep cgroup
2022-01-28 07:45:32: netdata INFO  : PLUGIN[cgroups] : thread created with task id 220
2022-01-28 07:45:32: netdata INFO  : PLUGIN[cgroups] : set name of thread 220 to PLUGIN[cgroups]
root@289f739:~# balena logs netdata_4487554_2052594 2>&1 --tail 4
2022-01-28 08:03:46: 35: 227 '[52.4.252.97]:53968' 'DATA' (sent/all = 1059/6381 bytes -83%, prep/sent/total = 0.41/4.52/4.93 ms) 200 '/api/v1/data'
2022-01-28 08:03:46: 27: 227 '[52.4.252.97]:34794' 'DATA' (sent/all = 128/130 bytes -2%, prep/sent/total = 0.07/0.95/1.02 ms) 200 '//api/v1/alarms'
2022-01-28 08:03:57: 27: 227 '[52.4.252.97]:34794' 'DATA' (sent/all = 127/130 bytes -2%, prep/sent/total = 0.04/0.06/0.10 ms) 200 '//api/v1/alarms'
2022-01-28 08:04:08: 27: 227 '[52.4.252.97]:34794' 'DATA' (sent/all = 127/130 bytes -2%, prep/sent/total = 0.04/0.05/0.09 ms) 200 '//api/v1/alarms'
root@289f739:~# ps aux | grep /netdata
201         1431  1.3  2.2 112236 45668 pts/0    SNsl+ 07:45   0:16 /usr/sbin/netdata -u netdata -D -s /host -p 19999 -W set web web files group root -W set web web files owner root
201         1627  0.0  0.3  30356  7356 pts/0    SNl+ 07:45   0:00 /usr/sbin/netdata --special-spawn-server
201         2021  0.0  0.0   2440  1880 pts/0    SN+  07:45   0:00 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
root        2026  0.2  0.1  54588  3124 pts/0    SN+  07:45   0:03 /usr/libexec/netdata/plugins.d/apps.plugin 1
201         2029  0.0  1.2 731336 24652 pts/0    SNl+ 07:45   0:00 /usr/libexec/netdata/plugins.d/go.d.plugin 1
201         2030  0.1  1.4  33872 29676 pts/0    SNl+ 07:45   0:01 /usr/bin/python3 /usr/libexec/netdata/plugins.d/python.d.plugin 1
root        7392  0.0  0.0   4236  1060 pts/1    S+   08:04   0:00 grep /netdata
root@289f739:~# ps -T -p 1431
    PID    SPID TTY          TIME CMD
   1431    1431 pts/0    00:00:08 netdata
   1431    1626 pts/0    00:00:00 netdata
   1431    1983 pts/0    00:00:00 netdata
   1431    1986 pts/0    00:00:00 GLOBAL_STATS
   1431    1987 pts/0    00:00:01 PLUGIN[proc]
   1431    1988 pts/0    00:00:00 PLUGIN[diskspac
   1431    1989 pts/0    00:00:00 PLUGIN[timex]
   1431    1990 pts/0    00:00:00 PLUGIN[cgroups]
   1431    1991 pts/0    00:00:00 PLUGIN[tc]
   1431    1992 pts/0    00:00:03 PLUGIN[idlejitt
   1431    1993 pts/0    00:00:00 STATSD
   1431    1994 pts/0    00:00:00 ACLK_Main
   1431    1997 pts/0    00:00:00 WEB_SERVER[stat
   1431    1998 pts/0    00:00:00 PLUGINSD
   1431    1999 pts/0    00:00:00 HEALTH
   1431    2000 pts/0    00:00:00 SERVICE
   1431    2001 pts/0    00:00:00 netdata
   1431    2003 pts/0    00:00:01 PLUGINSD[apps]
   1431    2006 pts/0    00:00:00 PLUGINSD[python
   1431    2007 pts/0    00:00:00 PLUGINSD[go.d]
   1431    2018 pts/0    00:00:00 STATSD_COLLECTO
   1431    6833 pts/0    00:00:00 netdata
   1431    6834 pts/0    00:00:00 netdata
   1431    6835 pts/0    00:00:00 netdata
   1431    6836 pts/0    00:00:00 netdata
root@289f739:~# cat /proc/1431/task/1990/cmdline 
/usr/sbin/netdata-unetdata-D-s/host-p19999-Wsetwebweb files grouproot-Wsetwebweb files ownerroot