dbengine not respecting configured disk usage limits

We have a batch cluster running a large number of jobs in containers. This has resulted in netdata filling up /var with metrics and I’ve been trying to update the configs to prevent that, and it doesn’t seem to be working. My netdata.conf reads:

[db]
  mode = dbengine
  update every = 2
  storage tiers = 1
  dbengine page cache size MB = 8
  dbengine multihost disk space MB = 218
  delete obsolete charts files = yes

However all systems (500 of them) are well over 218MB of disk used. After running an rm -rf on /var/cache/netdata and waiting 24 hours, most systems are well over 500MB and the worst example is at 5GB. Am I using these configuration options properly? Or is something entered wrong?

The logs indicate housecleaning is happening:

[root@worker01 netdata]# more error.log
2022-11-11 04:52:31: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-11-11 04:52:31: netdata INFO  : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000000188.ndf".
2022-11-11 04:52:31: netdata INFO  : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000000188.njf".
2022-11-11 05:26:41: netdata INFO  : MAIN : Deleting data file "/var/cache/netdata/dbengine/datafile-1-0000000179.ndf".
2022-11-11 05:26:41: netdata INFO  : MAIN : Deleting data and journal file pair.
2022-11-11 05:26:41: netdata INFO  : MAIN : Deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000000179.njf".
2022-11-11 05:26:41: netdata INFO  : MAIN : Deleted data file "/var/cache/netdata/dbengine/datafile-1-0000000179.ndf".
2022-11-11 05:26:41: netdata INFO  : MAIN : Reclaimed 32305152 bytes of disk space.
2022-11-11 09:59:44: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-11-11 09:59:44: netdata INFO  : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000000189.ndf".
2022-11-11 09:59:44: netdata INFO  : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000000189.njf".

However the meta.db files are way over the limit. Do the disk limit config options not apply to the .db files?

[root@worker01 netdata]# du -sh *
48K     anomaly-detection.db
4.0K    context-meta.db
32K     context-meta.db-shm
88K     context-meta.db-wal
202M    dbengine
4.7G    netdata-meta.db
14M     netdata-meta.db-shm
17M     netdata-meta.db-wal

This is running netdata-1.36.1-1.el7.x86_64, which we updated to recently. But this also happened with the previous rpm, netdata-1.35.1-3.el7.x86_64.

I believe it is the cgroup metrics collection that is the source of the growth. These workers run dozens of ephemeral containers per day, each with unique names. This means the cardinality is always growing and maybe it’s not being cleaned out very well? Just a guess…

Hi @sether

Thanks for the report.

Yes, it appears the cause of the excess size in /var/cache/netdata is netdata-meta.db which is our sqlite database. The config part dbengine multihost disk space MB will only take care of /var/cache/netdata/dbengine file sizes which appears to behave correctly.

It could be what you’re indicating as the source of the problem. Right now, is it possible on at least one of those nodes that exhibit the problem, to install and test a recent nightly version? There have been some changes to the database that could help with this.

Thanks!

Awesome, thanks for the suggestion. Installing the latest nightly (netdata-1.36.0.347.nightly-1) brought the size of netdata-meta.db down significantly:

[root@worker01 netdata]# du -sh *
3.7M    anomaly-detection.db
4.0K    context-meta.db
32K     context-meta.db-shm
164K    context-meta.db-wal
214M    dbengine
361M    netdata-meta.db
32K     netdata-meta.db-shm
17M     netdata-meta.db-wal

I might be happier if there were disk usage limits for the sqlite db to ensure it can never fill up the disk, but as long as this file doesn’t grow continually it’ll be okay. 361M has been consistent for several hours now so I’m fine with it like this.

thanks again