Netdata Community

How to debug container name resolution for the Netdata Agent

A couple of days ago I started a new container-based project and naturally, I installed Netdata to be able to have per-second metrics with minimal effort.

This was crucial, as the project concerned urbit, a fascinating project that wants to re-invent the personal computer. We won’t go into detail, but it’s a sort of a VM that I deployed on my Raspberry pi using balena.

Balena is a container-based platform to manage IoT devices and the lifecycle of their applications, so I deployed 2 containers on my Raspberry pi, a container running Urbit and a container running Netdata on docker.

To deploy Netdata, I defined a docker-compose file based on the documentation and a Dockerfile that used netdata/netdata as base.

Why Modify the Netdata Dockerfile?

Netdata’s dockerfile is robust, created by our own @Austin_Hemmelgarn , but if you want to customize your Netdata installation, you will need to use our Dockerfile as base and create your own.

Customing the Dockerfile is a great way to add custom configuration for Netdata, collectors, and alarms. This configuration will be copied every time you build the container, making it a great choice for automation. SSHing into the container to use ./edit-config is far from ideal :sweat_smile:

Another great reason to do this is the ability to load custom software into the container. For example, you might want to have a proper ssh server inside the container, so that you can ssh into it remotely.

How to fix container name resolution

It’s an issue that we see from time to time. Netdata fails to translate the container_id into the humanly readable container name, making the integration considerably harder to use.

This boils down to a particular script that netdata uses, named cgroup-name.sh. This script is run by the Netdata Agent and communicates over HTTP or Linux socket with various container daemons, in order to find the name.

In our case, it communicates with the docker daemon, over the docker socket.

As per the documentation, there are a bunch of different ways to tackle this. I chose to go with an option that is both somewhat secure and somewhat easy, that is to add the netdata user in the docker group, thus allowing netdata to use the socket and get the information it needs.

The problem is that in order to do that, we need to define the PGID of the docker group in the docker-compose.yml file. (check the image above).

And here lies the gotcha.

In the Netdata docs, we use the /etc/group as the source of truth to find the PGID of the group docker. If we look at the startup script of the Netdata container, it created the docker group and proceeds to add the netdata user to it.

Thus, when we read /etc/group, we will read the value that we in fact populated, by defining the PGID and then running the default ENTRYPOINT of the Netdata container.

It’s a self-fulfilling prophecy.

How Container Resolution is happening at Netdata?

As we have said, Netdata needs access to the docker socket that lives in /var/run/docker.sock. This is the endpoint that the cgroup-name.sh script is using to translate IDs to names.

Thus, what we really really want is that the netdata user belongs to the same group to which that file belongs.

Thus, the best way to tackle container name resolution is:

  1. Run netdata container without defining a PGID
  2. SSH into the container (e.g docker exec /bin/bash )
  3. run ls -l /var/run. The second number in the columns is the PGID that we are searching for
  4. Go back to our docker run or docker-compose.yml and modify the PGID

In the image below, the PGID has been set correctly. This means that the PGID for the docker group and the PGID of the group to which docker.sock belongs to are the same.

Thus, ls -l will not output a PGID, but instead the name of the group which has that PGID, in our case docker.

In conclusion

  1. If name resolution doesn’t work, it’s probably because the netdata user can’t access the socket.
  2. To fix this, run ls -l /var/run/ and find the docker.sock. If the group is not properly set, it will have a number instead of docker. That number is the PGID. Note that alternative container solutions could have docker-compatible sockets with different names. For example, balena has balena-engine.sock. You will need to define this in the docker-compose.yaml file, the custom Dockerfile or docker run.
  3. Go back in docker-compose/docker run and replace the PGID.
  4. Build the container(s) again

What do you think of Netdata + Containers ? What would you like us to improve?

Comment below :point_down::point_down::point_down::point_down: