We have a batch cluster running a large number of jobs in containers. This has resulted in netdata filling up /var with metrics and I’ve been trying to update the configs to prevent that, and it doesn’t seem to be working. My netdata.conf reads:
[db]
mode = dbengine
update every = 2
storage tiers = 1
dbengine page cache size MB = 8
dbengine multihost disk space MB = 218
delete obsolete charts files = yes
However all systems (500 of them) are well over 218MB of disk used. After running an rm -rf on /var/cache/netdata and waiting 24 hours, most systems are well over 500MB and the worst example is at 5GB. Am I using these configuration options properly? Or is something entered wrong?
The logs indicate housecleaning is happening:
[root@worker01 netdata]# more error.log
2022-11-11 04:52:31: netdata INFO : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-11-11 04:52:31: netdata INFO : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000000188.ndf".
2022-11-11 04:52:31: netdata INFO : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000000188.njf".
2022-11-11 05:26:41: netdata INFO : MAIN : Deleting data file "/var/cache/netdata/dbengine/datafile-1-0000000179.ndf".
2022-11-11 05:26:41: netdata INFO : MAIN : Deleting data and journal file pair.
2022-11-11 05:26:41: netdata INFO : MAIN : Deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000000179.njf".
2022-11-11 05:26:41: netdata INFO : MAIN : Deleted data file "/var/cache/netdata/dbengine/datafile-1-0000000179.ndf".
2022-11-11 05:26:41: netdata INFO : MAIN : Reclaimed 32305152 bytes of disk space.
2022-11-11 09:59:44: netdata INFO : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-11-11 09:59:44: netdata INFO : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000000189.ndf".
2022-11-11 09:59:44: netdata INFO : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000000189.njf".
However the meta.db files are way over the limit. Do the disk limit config options not apply to the .db files?
[root@worker01 netdata]# du -sh *
48K anomaly-detection.db
4.0K context-meta.db
32K context-meta.db-shm
88K context-meta.db-wal
202M dbengine
4.7G netdata-meta.db
14M netdata-meta.db-shm
17M netdata-meta.db-wal
This is running netdata-1.36.1-1.el7.x86_64, which we updated to recently. But this also happened with the previous rpm, netdata-1.35.1-3.el7.x86_64.
I believe it is the cgroup metrics collection that is the source of the growth. These workers run dozens of ephemeral containers per day, each with unique names. This means the cardinality is always growing and maybe it’s not being cleaned out very well? Just a guess…