Need some help with cgroups configuration for qemu VMs

Hi there,

I have installed netdata on a host where I run plenty of docker containers and out of the box I can see stats for each container:
image

As a new user I can only add one image per post :frowning: hence my reply to my own topic.

Next I installed netdata on a host where I run KVMs (using Proxmox qhich is based on qemu) but here I don’t see the stats per KVM but rather alltogether:
image

I found the instructions a bit beyond my understanding, maybe someone can help me configure netdata so that the resources used are grouped by VM?
=> cgroups.plugin | Learn Netdata

I don’t know if it will help but try to tweak

[plugin:cgroups]
	enable by default cgroups matching =  !*/init.scope  !/system.slice/run-*.scope  *.scope  /machine.slice/*.service  /kubepods/pod*/*  /kubepods/*/pod*/*  !/kubepods*  !*/vcpu*  !*/emulator  !*.mount  !*.partition  !*.service  !*.socket  !*.slice  !*.swap  !*.user  !/  !/docker  !/libvirt  !/lxc  !/lxc/*/*  !/lxc.monitor*  !/lxc.pivot  !/lxc.payload  !/machine  !/qemu  !/system  !/systemd  !/user  * 
	search for cgroups in subpaths matching =  !*/init.scope  !*-qemu  !*.libvirt-qemu  !/init.scope  !/system  !/systemd  !/user  !/user.slice  !/lxc/*/*  !/lxc.monitor  !/lxc.payload/*/*  !/lxc.payload.*  * 

in netdata.conf

1 Like

Thanks @vlvkobal, this is the answer. Try to remove the !*-qemu bit in the configuration and then restart Netdata.

but here I don’t see the stats per KVM but rather alltogether

@vlvkobal your fix filters out kvm all qemu stats, doesn’t it?

I found the issue QEMU metrics incorrect and crashing dashboard Β· Issue #9254 Β· netdata/netdata Β· GitHub

Thanks everyone. I removed the β€œ!*-qemu” bit in both lines (enable by default cgroups matching = and search for cgroups in subpaths matching =) and restarted netdata.

I am not accessing netdata locally - only through netdata cloud. A few refreshes later I still see:
image

That isn’t a fix, it’s just our default config. I don’t know what is the structure of containers in QEMU, that’s why I suggested tweaking the configuration parameters, rather than providing a more specific suggestion.

Thanks for your pointer, I understand. I had hoped someone would be using qemu and netdata and might be able to provide more help.

Lets give it a few days, maybe someone will spot this thread :slight_smile:

That is cgroup directory structure on my server

[ilyam@pc ~]$ tree -d -L 5 /sys/fs/cgroup/cpu/
/sys/fs/cgroup/cpu/
β”œβ”€β”€ dev-hugepages.mount
β”œβ”€β”€ dev-mqueue.mount
β”œβ”€β”€ docker
β”‚   └── d9484ad246a254c96d52f9e3a0b414a80026bf5a17c560cc648d640cd9e98793
β”œβ”€β”€ init.scope
β”œβ”€β”€ -.mount
β”œβ”€β”€ proc-sys-fs-binfmt_misc.mount
β”œβ”€β”€ sys-fs-fuse-connections.mount
β”œβ”€β”€ sys-kernel-config.mount
β”œβ”€β”€ sys-kernel-debug.mount
β”œβ”€β”€ sys-kernel-tracing.mount
β”œβ”€β”€ system.slice
β”‚   β”œβ”€β”€ apparmor.service
β”‚   β”œβ”€β”€ avahi-daemon.service
β”‚   β”œβ”€β”€ avahi-daemon.socket
β”‚   β”œβ”€β”€ boot-efi.mount
β”‚   β”œβ”€β”€ cronie.service
β”‚   β”œβ”€β”€ dbus.service
β”‚   β”œβ”€β”€ dbus.socket
β”‚   β”œβ”€β”€ dm-event.socket
β”‚   β”œβ”€β”€ docker.service
β”‚   β”œβ”€β”€ docker.socket
β”‚   β”œβ”€β”€ kmod-static-nodes.service
β”‚   β”œβ”€β”€ lm_sensors.service
β”‚   β”œβ”€β”€ lvm2-lvmpolld.socket
β”‚   β”œβ”€β”€ lvm2-monitor.service
β”‚   β”œβ”€β”€ ModemManager.service
β”‚   β”œβ”€β”€ NetworkManager.service
β”‚   β”œβ”€β”€ NetworkManager-wait-online.service
β”‚   β”œβ”€β”€ ntpd.service
β”‚   β”œβ”€β”€ polkit.service
β”‚   β”œβ”€β”€ rtkit-daemon.service
β”‚   β”œβ”€β”€ run-user-1000-gvfs.mount
β”‚   β”œβ”€β”€ run-user-1000.mount
β”‚   β”œβ”€β”€ sddm.service
β”‚   β”œβ”€β”€ smartd.service
β”‚   β”œβ”€β”€ snapd.apparmor.service
β”‚   β”œβ”€β”€ snapd.socket
β”‚   β”œβ”€β”€ sshd.service
β”‚   β”œβ”€β”€ systemd-binfmt.service
β”‚   β”œβ”€β”€ systemd-coredump.socket
β”‚   β”œβ”€β”€ systemd-journald-audit.socket
β”‚   β”œβ”€β”€ systemd-journald-dev-log.socket
β”‚   β”œβ”€β”€ systemd-journald.service
β”‚   β”œβ”€β”€ systemd-journald.socket
β”‚   β”œβ”€β”€ systemd-journal-flush.service
β”‚   β”œβ”€β”€ systemd-logind.service
β”‚   β”œβ”€β”€ systemd-modules-load.service
β”‚   β”œβ”€β”€ systemd-random-seed.service
β”‚   β”œβ”€β”€ systemd-remount-fs.service
β”‚   β”œβ”€β”€ systemd-rfkill.socket
β”‚   β”œβ”€β”€ systemd-sysctl.service
β”‚   β”œβ”€β”€ systemd-tmpfiles-setup-dev.service
β”‚   β”œβ”€β”€ systemd-tmpfiles-setup.service
β”‚   β”œβ”€β”€ systemd-udevd-control.socket
β”‚   β”œβ”€β”€ systemd-udevd-kernel.socket
β”‚   β”œβ”€β”€ systemd-udevd.service
β”‚   β”œβ”€β”€ systemd-udev-trigger.service
β”‚   β”œβ”€β”€ systemd-update-utmp.service
β”‚   β”œβ”€β”€ systemd-user-sessions.service
β”‚   β”œβ”€β”€ system-getty.slice
β”‚   β”œβ”€β”€ system-modprobe.slice
β”‚   β”œβ”€β”€ system-systemd\x2dcoredump.slice
β”‚   β”œβ”€β”€ system-systemd\x2dfsck.slice
β”‚   β”œβ”€β”€ tlp.service
β”‚   β”œβ”€β”€ tmp.mount
β”‚   β”œβ”€β”€ udisks2.service
β”‚   β”œβ”€β”€ upower.service
β”‚   └── wpa_supplicant.service
└── user.slice

70 directories

I have one docker container running and we can see it

β”œβ”€β”€ docker
β”‚   └── d9484ad246a254c96d52f9e3a0b414a80026bf5a17c560cc648d640cd9e98793

@ovi lets see what you have

Thanks!
I cut the output of that command as the qemu part is at the top, that should give you enough info or do you need the rest?

tree -d -L 5 /sys/fs/cgroup/cpu/

/sys/fs/cgroup/cpu/
β”œβ”€β”€ init.scope
β”œβ”€β”€ qemu.slice
β”‚   β”œβ”€β”€ 100.scope
β”‚   β”œβ”€β”€ 101.scope
β”‚   └── 102.scope
β”œβ”€β”€ system.slice
β”‚   β”œβ”€β”€ apparmor.service
β”‚   β”œβ”€β”€ blk-availability.service
β”‚   β”œβ”€β”€ console-setup.service
β”‚   β”œβ”€β”€ cron.service

Just to clarify, I expected to see stats for the 3 VMs separately grouped underneath the qemu menu in netdata, just like it is show for each docker container.
(Just making sure I properly explained things in my first to posts)

This is one sample container:
image

qemu are all together, not grouped by β€œslice”:
image

@ovi

Show ls -l /etc/pve/qemu-server/

We expect to see /etc/pve/qemu-server/100.conf(101, 102) in there. And we need /etc/pve/qemu-server/100.conf content to identify the problem, please share.

ls -l /etc/pve/qemu-server/

total 2
-rw-r----- 1 root www-data  396 Apr 12 05:51 100.conf
-rw-r----- 1 root www-data 1012 Apr 12 08:50 101.conf
-rw-r----- 1 root www-data  382 Apr 12 13:11 102.conf


 cat /etc/pve/qemu-server/100.conf

#Ubuntu 20.4
#mailcow
agent: 1
boot: order=ide2;scsi0;net0
cores: 2
ide2: none,media=cdrom
memory: 9216
name: mail
net0: virtio=CA:18:5B:74:78:78,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,discard=on,size=128G
scsihw: virtio-scsi-pci
smbios1: uuid=34fdaae9-f689-4903-8f17-f683578ee6bb
sockets: 1
startup: up=60,down=60
vmgenid: f8209b9e-fc9e-4a2f-9c51-bb5995a2c5fb

We suspect that the problem is cgroups name resolution script.

A quick fix is to disable name resolution for qemu cgroups

add this line to the netdata.conf to the [plugin:cgroups] section

run script to rename cgroups matching =  !/  !*.mount  !*.socket  !*.partition  /machine.slice/*.service  !*.service  !*.slice  !*.swap  !*.user  !init.scope  !*.scope/vcpu*  !*.scope/emulator  *.scope  *docker*  *lxc*  !*qemu*  *kubepods*  *.libvirt-qemu  *

Don’t forget to restart netdata service after the changes.

in my /etc/netdata/netdata.conf I had that line already but commented. I un-commented it and compared to yours - they are identical but there are still no changes.
image

Just to summarize:

In these two lines I have removed the β€œ!*-qemu” bits

enable by default cgroups matching
search for cgroups in subpaths matching

while this line:
run script to rename cgroups matching

still contains it.

default run script to rename cgroups matching contains *qemu*, i changed it to !*qemu*

sorry, my mistake thanks for spotting it but even with that change I see no result :frowning:

Just to be sure, you’ve restarted netdata servie after the changes?


Anyway, let’s check logs

grep -i cgroup-name error.log

Yes I did, after every change.

Thanks for the pointer with the netdata logs, I guess now all is clear:

/var/log/netdata# grep -i cgroup-name error.log
2021-04-12 13:05:24: cgroup-name.sh: ERROR: proxmox config file missing /etc/pve/qemu-server/100.conf or netdata does not have read access.  Please ensure netdata is a member of www-data group.

fixed by adding the user to the group:
adduser netdata www-data

check:
groups netdata
netdata : netdata adm proxy www-data ceph

restarted netdata and it works:

image

Do you reckon I still need all those edits we made during this thread except for adding netdata to the www-data group?