Also get your netdata.conf (just wget/curl like it says in the default file) and paste here the following two lines from it
enable by default cgroups matching = ....
search for cgroups in subpaths matching = ...
These are the two lines that control what patterns netdata will match. I would be extremely surprised to see them being different in the docker image, but I can’t think of any other explanation why the containers wouldn’t appear, though we’ll see if collector.log says anything interesting.
You can also play with these two settings, if you like, to see if you can get them to match the directory structure we see here (though again, why would it match when running inside docker is the key question).
The values of the search and enable params are formatted according to the netdata simple patterns syntax. If you uncomment the line, you can start making changes, to try them out, by restarting netdata. Commenting it again and restarting will revert to the default settings.
We’re still scratching our head here, but haven’t given up
Those lines are commented (we didn’t touch those on the hosts that show container metrics without any issues). Previously, I played with those settings when ilyam8 told that we should look for the cgroups with the names lxc.payload.*
That didn’t work.
Were those explicity stated (or enabled) in the docker image you guys shared?
Okay. Do you want me to use the netdata conf that’s on the host with the docker container or do you want me to deploy the image with another paramenters? Or do I use the same docker command you posted eariler in this thread?
Assuming you asked me - I didn’t mean to use the docker image again. Just shared the branch I used to build the docker image in case you want to install that version with additional logging on the host. From the logs you shared (docker container) - I didn’t spot any problems.
2023-02-15 10:16:18: netdata INFO : P[cgroups] : (0274@collectors/cgroups.p:cgroups_try_det): cgroups v2 (unified cgroups) is available but are disabled on this system.
2023-02-15 10:16:18: netdata INFO : P[cgroups] : (0314@collectors/cgroups.p:read_cgroup_plu): use unified cgroups false
2023-02-15 10:19:19: netdata INFO : P[cgroups] : (4750@collectors/cgroups.p:cgroup_main_cle): cleaning up...
2023-02-15 10:19:19: netdata INFO : P[cgroups] : (4755@collectors/cgroups.p:cgroup_main_cle): stopping discovery thread worker
Now, I think that this could potentially explain why the behavior is different when running in a container, because the code that autodetect the cgroup version seems to just execute some commands wherever netdata is running. In the container itself it’s quite possible to get one answer, and a different answer when running on the host.
If I am right, I expect that all that is needed is the following
[plugin:cgroups]
use unified cgroups = yes
path to unified cgroups = /sys/fs/cgroup
We intentionally disabled cgroup v2 by specifying “systemd.unified_cgroup_hierarchy=0” in the grub file as LXD had problems with cgroup v2.
I’ll try setting that parameter to “no” and see if that works (on another host where netdata doesn’t show metrics as the current host has debugging enabled).
That didn’t work. Have set “use unified cgroups” from “auto” to “no” and restarted Netdata. I do not see container metrics. Any other parameter has to be changed?
We have another host, the one I tested on by setting unified cgroups to “no” has no “systemd.unified_cgroup_hierarchy=0” parameter set and is running Focal. This Netdata too didn’t show container metrics.
Take the following two lines from netdata.conf and create uncommented versions of them right below.
Then start removing exclusions (!xyz) that might be relevant, to tell Netdata to process and try to display more folders. If it shows them (along with other garbage it will display), we’ll know that it’s the filters. If not, we’ll think of something else. The ideal version of this test is to just put a * in each line, so that everything is processed, but that would create a lot of garbage and I’m not sure how it would affect your system, so I would not do it in production.
enable by default cgroups matching = ...
search for cgroups in subpaths matching = ...
So see if you can tweak these lines to make it match and we’ll see if we can dig any deeper.
search for cgroups in subpaths matching = !*/init.scope !*-qemu !*.libvirt-qemu !/init.scope !/system !/systemd !/user !/user.slice !/lxc/*/* !/lxc.monitor !/lxc.payload/*/* !/lxc.payload.* *
Do I remove the exclamation or the field entirely? For example, remove these two “/lxc.payload// !/lxc.payload.” entirely? Or just the exclamation? I believe, as ilyam8 said that the payload dirs have to be included, they have to be included in the search.
search for cgroups in subpaths matching = !*/init.scope !*-qemu !*.libvirt-qemu !/init.scope !/system !/systemd !/user !/user.slice !/lxc/*/* /lxc.monitor /lxc.payload/*/* /lxc.payload.* *
As you can see, I allowed the lxc payload and monitor patterns.
And under “/sys/fs/cgroup/devices” I see the “lxc.monitor.<container_name>” and “lxc.payload.” directories and they have readable permissions for other users.
Also, after restarting, look at collector.log again, see if there’s any difference with the previous runs. We really want it to start logging some new cgroups there.
Now we have a new problem. When we previously started having disk space issues after enabling debugging, we nuked the logs directory. So since you asked me for the collector logs after setting the search pattern, I made the changes and restarted Netdata but it wont start. Here are the journalctl logs:
Feb 21 08:23:50 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:23:50 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:23:50 worf netdata[3565232]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:23:50 worf netdata[3565232]: 2023-02-21 08:23:50: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:23:50 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:23:50 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:24:20 worf systemd[1]: netdata.service: Scheduled restart job, restart counter is at 1.
Feb 21 08:24:20 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:24:20 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:24:20 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:24:20 worf netdata[3597576]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:20 worf netdata[3597576]: 2023-02-21 08:24:20: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:20 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:24:20 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:24:51 worf systemd[1]: netdata.service: Scheduled restart job, restart counter is at 2.
Feb 21 08:24:51 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:24:51 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:24:51 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:24:51 worf netdata[3632365]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:51 worf netdata[3632365]: 2023-02-21 08:24:51: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:51 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:24:51 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:25:21 worf systemd[1]: netdata.service: Scheduled restart job, restart counter is at 3.
Feb 21 08:25:21 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:25:21 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:25:21 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:25:21 worf netdata[3667241]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:21 worf netdata[3667241]: 2023-02-21 08:25:21: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:21 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:25:21 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:25:37 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:25:37 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:25:37 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:25:37 worf netdata[3684965]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:37 worf netdata[3684965]: 2023-02-21 08:25:37: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:37 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:25:37 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:25:50 worf systemd[1]: Stopped Real time performance monitoring.
I reinstalled Netdata (building from source), still Netdata is not starting. The logs, even after a reinstall (as in journalctl) say that it’s being killed. The host we’re running Netdata on is definitely not running out of resources.
And I thought that there might be some mistake in the search pattern, so I commented those lines and restarted. Still it’s not starting.
As stated earlier, we can give access to our host. If you want to look, please send your public key to support@webdock.io and we’ll add it to our host.