LXC Containers Stats Are Not Shown

Thanks, I’d like to also see collector.log

Also get your netdata.conf (just wget/curl like it says in the default file) and paste here the following two lines from it

enable by default cgroups matching = ....
search for cgroups in subpaths matching = ...

These are the two lines that control what patterns netdata will match. I would be extremely surprised to see them being different in the docker image, but I can’t think of any other explanation why the containers wouldn’t appear, though we’ll see if collector.log says anything interesting.

You can also play with these two settings, if you like, to see if you can get them to match the directory structure we see here (though again, why would it match when running inside docker is the key question).

The values of the search and enable params are formatted according to the netdata simple patterns syntax. If you uncomment the line, you can start making changes, to try them out, by restarting netdata. Commenting it again and restarting will revert to the default settings.

We’re still scratching our head here, but haven’t given up :slight_smile:

Sure. I’m including both files.

Those lines are commented (we didn’t touch those on the hosts that show container metrics without any issues). Previously, I played with those settings when ilyam8 told that we should look for the cgroups with the names lxc.payload.*

That didn’t work.

Were those explicity stated (or enabled) in the docker image you guys shared?

@philip ilyam8/netdata-test-for-github image is the latest master + additional logging (no other changes)

If you want to compile/install that branch:

git clone --recursive https://github.com/ilyam8/netdata nd_ilyam8
cd nd_ilyam8
git checkout cgroups_debug_forum_topic_3616
sudo ./netdata-installer.sh

Okay. Do you want me to use the netdata conf that’s on the host with the docker container or do you want me to deploy the image with another paramenters? Or do I use the same docker command you posted eariler in this thread?

Assuming you asked me - I didn’t mean to use the docker image again. Just shared the branch I used to build the docker image in case you want to install that version with additional logging on the host. From the logs you shared (docker container) - I didn’t spot any problems.

@ilyam8

All right. Will install that branch directly on the host and will get back to you. I’ll post the error and debug logs.

@ilyam8

Could you check this logs I sent (also posted the collector logs)? I’ve compiled, enabled debugging, and installed Netdata directly on the host.

@ilyam / @ferroin I see the following:

2023-02-15 10:16:18: netdata INFO  : P[cgroups] : (0274@collectors/cgroups.p:cgroups_try_det): cgroups v2 (unified cgroups) is available but are disabled on this system.
2023-02-15 10:16:18: netdata INFO  : P[cgroups] : (0314@collectors/cgroups.p:read_cgroup_plu): use unified cgroups false
2023-02-15 10:19:19: netdata INFO  : P[cgroups] : (4750@collectors/cgroups.p:cgroup_main_cle): cleaning up...
2023-02-15 10:19:19: netdata INFO  : P[cgroups] : (4755@collectors/cgroups.p:cgroup_main_cle): stopping discovery thread worker

The message is coming from the code that was added in Improve cgroups collector to autodetect unified cgroups by underhood · Pull Request #9249 · netdata/netdata · GitHub

This is going over my head now, but I did find the following specifically for lxc containers and unified/hybrid modes:

https://wiki.debian.org/LXC/CGroupV2

Now, I think that this could potentially explain why the behavior is different when running in a container, because the code that autodetect the cgroup version seems to just execute some commands wherever netdata is running. In the container itself it’s quite possible to get one answer, and a different answer when running on the host.

If I am right, I expect that all that is needed is the following

[plugin:cgroups]
    use unified cgroups = yes
    path to unified cgroups = /sys/fs/cgroup

which is described here

Can you give this a shot @philip ?

@Christopher_Akritid1

We intentionally disabled cgroup v2 by specifying “systemd.unified_cgroup_hierarchy=0” in the grub file as LXD had problems with cgroup v2.

I’ll try setting that parameter to “no” and see if that works (on another host where netdata doesn’t show metrics as the current host has debugging enabled).

@Christopher_Akritid1

That didn’t work. Have set “use unified cgroups” from “auto” to “no” and restarted Netdata. I do not see container metrics. Any other parameter has to be changed?

@Christopher_Akritid1

We have another host, the one I tested on by setting unified cgroups to “no” has no “systemd.unified_cgroup_hierarchy=0” parameter set and is running Focal. This Netdata too didn’t show container metrics.

:man_shrugging: :smile:

We are pretty stumped here.

Let me suggest a trial and error approach:

Take the following two lines from netdata.conf and create uncommented versions of them right below.
Then start removing exclusions (!xyz) that might be relevant, to tell Netdata to process and try to display more folders. If it shows them (along with other garbage it will display), we’ll know that it’s the filters. If not, we’ll think of something else. The ideal version of this test is to just put a * in each line, so that everything is processed, but that would create a lot of garbage and I’m not sure how it would affect your system, so I would not do it in production.

enable by default cgroups matching = ...
search for cgroups in subpaths matching = ...

So see if you can tweak these lines to make it match and we’ll see if we can dig any deeper.

Here is the line:

search for cgroups in subpaths matching =  !*/init.scope  !*-qemu  !*.libvirt-qemu  !/init.scope  !/system  !/systemd  !/user  !/user.slice  !/lxc/*/*  !/lxc.monitor  !/lxc.payload/*/*  !/lxc.payload.*  *

Do I remove the exclamation or the field entirely? For example, remove these two “/lxc.payload// !/lxc.payload.” entirely? Or just the exclamation? I believe, as ilyam8 said that the payload dirs have to be included, they have to be included in the search.

Hmm…checking. Just removed the exclamation and restart netdata. Giving it a few minutes.

@Christopher_Akritid1

Still no.

Here is the search pattern I’ve set:

search for cgroups in subpaths matching =  !*/init.scope  !*-qemu  !*.libvirt-qemu  !/init.scope  !/system  !/systemd  !/user  !/user.slice  !/lxc/*/*  /lxc.monitor  /lxc.payload/*/*  /lxc.payload.*  *

As you can see, I allowed the lxc payload and monitor patterns.

And under “/sys/fs/cgroup/devices” I see the “lxc.monitor.<container_name>” and “lxc.payload.” directories and they have readable permissions for other users.

Netdata simple patterns explains the syntax.

In enable by default cgroups matching I’d remove all !xyz that contain lxc, though I had checked the tree and supposedly it doesn’t match.

The search looks ok to me based on your directory tree, just in case perhaps also remove !/lxc/*/*

Also, after restarting, look at collector.log again, see if there’s any difference with the previous runs. We really want it to start logging some new cgroups there.

@Christopher_Akritid1

Here is the patterns:

search for cgroups in subpaths matching =  !*/init.scope  !*-qemu  !*.libvirt-qemu  !/init.scope  !/system  !/systemd  !/user  !/user.slice  !/lxc/*/*  /lxc.monitor  /lxc.payload/*/*  /lxc.payload.*  *
enable by default cgroups matching =  !*/init.scope  !/system.slice/run-*.scope  *.scope  /machine.slice/*.service  */kubepods/pod*/*  */kubepods/*/pod*/*  */*-kubepods-pod*/*  */*-kubepods-*-pod*/*  !*kubepods* !*kubelet* !*/vcpu*  !*/emulator  !*.mount  !*.partition  !*.service  !*.socket  !*.slice  !*.swap  !*.user  !/  !/docker  !*/libvirt  /lxc  /lxc/*/*  /lxc.monitor*  !/lxc.pivot  /lxc.payload  !/machine  !/qemu  !/system  !/systemd  !/user  *

The result is the same.

I didn’t gather the collector logs as I’m modifying these on a different host where degugging is not enabled.

I’ll soon do this on the debugging-enabled host and will post the logs to you.

Hey @philip. Can we have a call to debug the problem later this week? Do you have Discord? If so join Netdata and DM me (ilyam#0065).

@Christopher_Akritid1

Now we have a new problem. When we previously started having disk space issues after enabling debugging, we nuked the logs directory. So since you asked me for the collector logs after setting the search pattern, I made the changes and restarted Netdata but it wont start. Here are the journalctl logs:

Feb 21 08:23:50 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:23:50 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:23:50 worf netdata[3565232]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:23:50 worf netdata[3565232]: 2023-02-21 08:23:50: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:23:50 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:23:50 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:24:20 worf systemd[1]: netdata.service: Scheduled restart job, restart counter is at 1.
Feb 21 08:24:20 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:24:20 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:24:20 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:24:20 worf netdata[3597576]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:20 worf netdata[3597576]: 2023-02-21 08:24:20: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:20 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:24:20 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:24:51 worf systemd[1]: netdata.service: Scheduled restart job, restart counter is at 2.
Feb 21 08:24:51 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:24:51 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:24:51 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:24:51 worf netdata[3632365]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:51 worf netdata[3632365]: 2023-02-21 08:24:51: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:24:51 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:24:51 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:25:21 worf systemd[1]: netdata.service: Scheduled restart job, restart counter is at 3.
Feb 21 08:25:21 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:25:21 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:25:21 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:25:21 worf netdata[3667241]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:21 worf netdata[3667241]: 2023-02-21 08:25:21: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:21 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:25:21 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:25:37 worf systemd[1]: Stopped Real time performance monitoring.
Feb 21 08:25:37 worf systemd[1]: Starting Real time performance monitoring...
Feb 21 08:25:37 worf systemd[1]: Started Real time performance monitoring.
Feb 21 08:25:37 worf netdata[3684965]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:37 worf netdata[3684965]: 2023-02-21 08:25:37: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Feb 21 08:25:37 worf systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
Feb 21 08:25:37 worf systemd[1]: netdata.service: Failed with result 'signal'.
Feb 21 08:25:50 worf systemd[1]: Stopped Real time performance monitoring.

I reinstalled Netdata (building from source), still Netdata is not starting. The logs, even after a reinstall (as in journalctl) say that it’s being killed. The host we’re running Netdata on is definitely not running out of resources.

And I thought that there might be some mistake in the search pattern, so I commented those lines and restarted. Still it’s not starting.

As stated earlier, we can give access to our host. If you want to look, please send your public key to support@webdock.io and we’ll add it to our host.