Netdata using lots of Memory

Problem/Question

I am running Netdata in a Docker Swarm. One node with 8 GB of RAM is using 1,6 GB just for Netdata. Other nodes are using between 150 and 400 MB as well. This is all disproportionally high for a simple monitoring, and it did not use to be that way. What can be done to resolve that?

Environment/Browser/Agent’s version etc

Agent V1.42.1 in Docker

Hi @Martin_Neumann

In general, Netdata memory usage is proportional to metrics count including those of children if it is a parent.

There is a chart under Netdata Monitoring section that breaks up memory usage among Netdata components (called netdata.memory).

Can you screenshot that and post it please?


That netdata container was already using 1,6GB of Ram


And it is using more resources than Metabase and my Percona database. I want to know how to change that.


Seems that netdata memory usage (green) is growing within a few hours of use


Seems that Netdata opens more processes then anything else

Hi Martin.

Do you have a setup where you have ephemeral containers being created and destroyed? Is it possible to test the nightly version of Netdata and check if it exhibits the same behaviour?

I have the container of Percona Xtrabackup being created once a day and being destroyed. All other containers are being defined in stacks in Portainer.

Hi Martin.

Is it possible to list those threads using ps ? Get the pid of main netdata binary (ps aux | grep netdata) then do a ps -T -p <pid> ?

Also please the output of ps faux. Thanks!

root@manager-de1:~# ps aux | grep netdata
201 1929123 3.8 10.8 1164784 866036 ? SNsl Oct05 412:32 /usr/sbin/netdata -u netdata -D -s /host -p 19999
201 1929178 0.0 0.0 26832 700 ? SNl Oct05 0:00 /usr/sbin/netdata --special-spawn-server
root 1929876 0.2 0.1 773880 15740 ? SNl Oct05 22:25 /usr/libexec/netdata/plugins.d/go.d.plugin 1
root 1929886 5.2 0.3 55328 29560 ? SNl Oct05 566:17 /usr/libexec/netdata/plugins.d/apps.plugin 1
root 2656968 0.0 0.0 1044 332 ? SN 10:15 0:08 /usr/libexec/netdata/plugins.d/debugfs.plugin 1
201 2947881 0.1 0.0 2492 1848 ? SN 19:56 0:03 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
root 2967819 0.0 0.0 6320 648 pts/0 R+ 20:34 0:00 grep netdata

root@manager-de1:~# ps -T -p 1929123
PID SPID TTY TIME CMD
1929123 1929123 ? 00:00:26 netdata
1929123 1929177 ? 00:00:00 DAEMON_SPAWN
1929123 1929370 ? 00:07:38 DBENGINE
1929123 1929371 ? 00:02:55 UV_WORKER[1]
1929123 1929372 ? 00:02:51 UV_WORKER[2]
1929123 1929373 ? 00:03:10 UV_WORKER[3]
1929123 1929374 ? 00:02:57 UV_WORKER[4]
1929123 1929375 ? 00:02:38 UV_WORKER[15]
1929123 1929376 ? 00:03:22 UV_WORKER[16]
1929123 1929377 ? 00:02:45 UV_WORKER[5]
1929123 1929378 ? 00:03:01 UV_WORKER[8]
1929123 1929379 ? 00:02:57 UV_WORKER[12]
1929123 1929380 ? 00:02:57 UV_WORKER[10]
1929123 1929381 ? 00:02:49 UV_WORKER[6]
1929123 1929382 ? 00:03:01 UV_WORKER[13]
1929123 1929383 ? 00:03:13 UV_WORKER[11]
1929123 1929384 ? 00:02:57 UV_WORKER[7]
1929123 1929385 ? 00:02:56 UV_WORKER[14]
1929123 1929386 ? 00:03:08 UV_WORKER[9]
1929123 1929422 ? 00:00:59 METASYNC
1929123 1929821 ? 00:04:31 ACLKSYNC
1929123 1929824 ? 00:00:08 P[timex]
1929123 1929825 ? 00:28:25 P[idlejitter]
1929123 1929826 ? 00:12:32 HEALTH
1929123 1929827 ? 00:11:33 STATS_GLOBAL
1929123 1929828 ? 00:35:07 STATS_WORKERS
1929123 1929829 ? 00:01:19 STATS_SQLITE3
1929123 1929830 ? 00:00:06 PLUGINSD
1929123 1929831 ? 00:00:03 SERVICE
1929123 1929832 ? 00:01:17 STATSD_FLUSH
1929123 1929834 ? 00:01:55 WEB[1]
1929123 1929835 ? 00:02:16 ACLK_MAIN
1929123 1929836 ? 00:07:36 RRDCONTEXT
1929123 1929837 ? 00:00:40 REPLAY[1]
1929123 1929838 ? 00:00:00 DYNCFG
1929123 1929839 ? 00:00:51 P[tc]
1929123 1929840 ? 00:03:25 P[diskspace]
1929123 1929841 ? 00:27:46 P[proc]
1929123 1929842 ? 02:06:07 P[cgroups]
1929123 1929843 ? 00:08:42 PREDICT
1929123 1929844 ? 00:05:58 TRAIN[0]
1929123 1929845 ? 00:00:00 TRAIN[1]
1929123 1929846 ? 00:00:00 TRAIN[2]
1929123 1929847 ? 00:00:00 TRAIN[3]
1929123 1929848 ? 00:00:00 DAEMON_COMMAND
1929123 1929850 ? 00:00:46 P[diskspace slo
1929123 1929851 ? 00:00:34 PD[go.d]
1929123 1929855 ? 00:30:59 PD[apps]
1929123 1929858 ? 00:02:37 PD[debugfs]
1929123 1929859 ? 00:29:30 P[proc netdev]
1929123 1929862 ? 00:04:12 ACLK_STATS
1929123 1929866 ? 00:04:46 PLUGIN[cgroups]
1929123 1929868 ? 00:00:49 STATSD_IN[1]
1929123 1930095 ? 00:00:16 ANALYTICS
1929123 1930103 ? 00:00:22 ACLK_QRY[0]
1929123 1930104 ? 00:00:22 ACLK_QRY[1]

That’s what you need?

Thanks! Yes, that’s what I needed, but those seem ok, i.e. not much more than what Netdata typically does.

Do you have a parent-child setup between the Netdata containers, or is the problematic one on it’s own?

I have a docker swarm, but every node has its own stack deploying a netdata client, so in relation to netdata every node is on its own. Especially concerning is this one called manager-de1 which has a constant memory use slightly over 1 GB.

if you are interested, here is the stack:

version: “3.7”
services:
netdata-m1:
image: netdata/netdata:stable
hostname: manager-de1
#ports:
# - 19999:19999
cap_add:
- SYS_PTRACE
volumes:
- netdataconfig:/etc/netdata
- netdatalib:/var/lib/netdata
- netdatacache:/var/cache/netdata
- /etc/passwd:/host/etc/passwd:ro
- /etc/group:/host/etc/group:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /etc/os-release:/host/etc/os-release:ro
environment:
- NETDATA_CLAIM_TOKEN=secret
- NETDATA_CLAIM_URL=https://app.netdata.cloud
- NETDATA_CLAIM_ROOMS=secret
networks:
- network_public
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == manager-de1
labels:
- traefik.enable=1
- traefik.http.routers.netdata-m1.rule=Host(my.url.net)
- traefik.http.routers.netdata-m1.entrypoints=websecure
- traefik.http.routers.netdata-m1.priority=1
- traefik.http.routers.netdata-m1.tls.certresolver=letsencryptresolver
- traefik.http.routers.netdata-m1.service=netdata-m1
- traefik.http.services.netdata-m1.loadbalancer.server.port=19999
- traefik.http.services.netdata-m1.loadbalancer.passHostHeader=1
- traefik.http.middlewares.netdata-m1-auth.basicauth.users=secret
- traefik.http.routers.netdata-m1.middlewares=netdata-m1-auth

volumes:
netdataconfig:
external: true
name: netdataconfig
netdatalib:
external: true
name: netdatalib
netdatacache:
external: true
name: netdatacache
networks:
network_public:
name: network_public
external: true


Here the mem use of the netdata processes in a 30 day view. Seems that a couple of weeks ago it reached even 4 GB.


The processes used by this instance remained though constant at 58 threats, unlike the other instance I posted above which was just running crazy on opening processes.

Hi Martin.

Could you also please check the charts under Netdata monitoring? (esp. netdata.memory), to check which part of netdata exactly consumes the memory.

Also, if possible to check with newer 1.43.0 if it behaves the same ?

Sorry I was busy those days. Just updated to the latest stable v 1.34.2. It reduced netdata memory consumption from 820MB to 350 MB. That is better but still a bit high. DBengine.memory occupies about 76 MB total. And here is the netdata.memory graph before and after the update:

Under Applications - Memory still about 420 MB for netdata. Interestingly under netdata.memory it shows about 155 MB, and this values are quite stable for the last 7 days. Not sure why the discrepancy between the two, but that what it shows.