How to debug container name resolution for the Netdata Agent

A couple of days ago I started a new container-based project and naturally, I installed Netdata to be able to have per-second metrics with minimal effort.

This was crucial, as the project concerned urbit, a fascinating project that wants to re-invent the personal computer. We won’t go into detail, but it’s a sort of a VM that I deployed on my Raspberry pi using balena.

Balena is a container-based platform to manage IoT devices and the lifecycle of their applications, so I deployed 2 containers on my Raspberry pi, a container running Urbit and a container running Netdata on docker.

To deploy Netdata, I defined a docker-compose file based on the documentation and a Dockerfile that used netdata/netdata as base.

Why Modify the Netdata Dockerfile?

Netdata’s dockerfile is robust, created by our own @Austin_Hemmelgarn , but if you want to customize your Netdata installation, you will need to use our Dockerfile as base and create your own.

Customing the Dockerfile is a great way to add custom configuration for Netdata, collectors, and alarms. This configuration will be copied every time you build the container, making it a great choice for automation. SSHing into the container to use ./edit-config is far from ideal :sweat_smile:

Another great reason to do this is the ability to load custom software into the container. For example, you might want to have a proper ssh server inside the container, so that you can ssh into it remotely.

How to fix container name resolution

It’s an issue that we see from time to time. Netdata fails to translate the container_id into the humanly readable container name, making the integration considerably harder to use.

This boils down to a particular script that netdata uses, named cgroup-name.sh. This script is run by the Netdata Agent and communicates over HTTP or Linux socket with various container daemons, in order to find the name.

In our case, it communicates with the docker daemon, over the docker socket.

As per the documentation, there are a bunch of different ways to tackle this. I chose to go with an option that is both somewhat secure and somewhat easy, that is to add the netdata user in the docker group, thus allowing netdata to use the socket and get the information it needs.

The problem is that in order to do that, we need to define the PGID of the docker group in the docker-compose.yml file. (check the image above).

And here lies the gotcha.

In the Netdata docs, we use the /etc/group as the source of truth to find the PGID of the group docker. If we look at the startup script of the Netdata container, it created the docker group and proceeds to add the netdata user to it.

Thus, when we read /etc/group, we will read the value that we in fact populated, by defining the PGID and then running the default ENTRYPOINT of the Netdata container.

It’s a self-fulfilling prophecy.

How Container Resolution is happening at Netdata?

As we have said, Netdata needs access to the docker socket that lives in /var/run/docker.sock. This is the endpoint that the cgroup-name.sh script is using to translate IDs to names.

Thus, what we really really want is that the netdata user belongs to the same group to which that file belongs.

Thus, the best way to tackle container name resolution is:

  1. Run netdata container without defining a PGID
  2. SSH into the container (e.g docker exec /bin/bash )
  3. run ls -l /var/run. The second number in the columns is the PGID that we are searching for
  4. Go back to our docker run or docker-compose.yml and modify the PGID

In the image below, the PGID has been set correctly. This means that the PGID for the docker group and the PGID of the group to which docker.sock belongs to are the same.

Thus, ls -l will not output a PGID, but instead the name of the group which has that PGID, in our case docker.

In conclusion

  1. If name resolution doesn’t work, it’s probably because the netdata user can’t access the socket.
  2. To fix this, run ls -l /var/run/ and find the docker.sock. If the group is not properly set, it will have a number instead of docker. That number is the PGID. Note that alternative container solutions could have docker-compatible sockets with different names. For example, balena has balena-engine.sock. You will need to define this in the docker-compose.yaml file, the custom Dockerfile or docker run.
  3. Go back in docker-compose/docker run and replace the PGID.
  4. Build the container(s) again

What do you think of Netdata + Containers ? What would you like us to improve?

Comment below :point_down::point_down::point_down::point_down:

So I been running Netdata on my Alpine Linux docker host and its been excellent but had the “slightly annoying” container IDs displaying instead of the names.

So I decided to try and fix this. I tried all 3 ways of configuring this in the documentation with the following results

My setup
Alpine linux
docker
no special configuration

  1. Docker Socks Proxy
    I installed the “tecnativa/docker-socket-proxy” as suggested in the documentation and this “kinda” worked but with this flaw.
    The Netdata documentation suggests installing it to the default docker bridge network and then adding the “DOCKER_HOST=proxy:237”. The problem I found with this is “proxy” (the container name for “tecnativa/docker-socket-proxy”) isn’t resolvable on the default bridge network hence the Netdata container fails as it can’t resolve it. In fact on the default/system bridge network you cannot resolve containers by container name as per this tutorial. I find the Netdata documentation slightly misleading for this reason.

If I replace “proxy” with the internal IP address of the container it worked. Not a permanent fix but at least it proves things were working.

The “tecnativa/docker-socket-proxy” documentation suggests placing its container on an isolated user-defined network and connecting containers to that network if they need to use it. Also for user-defined bridge networks the containers are resolvable by container name (as per the tutorial above). I went down this track, got the proxy container running and resolvable by name BUT when I start the Netdata container on a user-defined bridge network the container starts but I can’t access the webportal on http://:19999. In fact this was a strange issue as it took me an evening to discover I had to reboot my docker host before even my original container would work again back on the default system bridge network it had been working fine on for months.

I therefore parked this option.

  1. PGID environment variable (my preferred option)
    I just couldn’t get this to work even after finding and adding the group ID to the docker run command
    -e PGID=101
    in my case.

  2. Root access (volume binding to /var/run/docker.sock)
    This worked but I didn’t leave it enabled for the obvious security reasons.

So… my questions.
Why can I not get PGID working ?

Why does running NetData container on a user defined bridge network cause its web portal to fail (if I try and curl I get “connection refused”)