Netdata does not respect disk space limits

Lex · May 19, 2023, 10:47am

I have local netdata 1.39.0 collecting data from several servers.
It is running on Arch Linux on ZFS.

Here is running db configuration:

[db]
        mode = dbengine
        update every = 4
        storage tiers = 3
        dbengine multihost disk space MB = 100000
        dbengine tier 1 multihost disk space MB = 60000
        dbengine tier 1 update every iterations = 15
        dbengine tier 2 multihost disk space MB = 20000
        dbengine tier 2 update every iterations = 60
        dbengine use direct io = no
        dbengine parallel initialization = yes

And real disk usage is following:

$ du -sm /var/cache/netdata/*
1       /var/cache/netdata/context-meta.db
1       /var/cache/netdata/context-meta.db-shm
1       /var/cache/netdata/context-meta.db-wal
75013   /var/cache/netdata/dbengine
37268   /var/cache/netdata/dbengine-tier1
3810    /var/cache/netdata/dbengine-tier2
37      /var/cache/netdata/ml.db
94      /var/cache/netdata/netdata-meta.db
1       /var/cache/netdata/netdata-meta.db-shm
9       /var/cache/netdata/netdata-meta.db-wal

Disk space limits are far from being reached but netdata already deleting older files:

2023-05-17 09:25:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleting data file '/var/cache/netdata/dbengine/datafile-1-0000001338.ndf'.
2023-05-17 09:25:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000001338.njf".
2023-05-17 09:25:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000001338.njfv2".
2023-05-17 09:25:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted data file "/var/cache/netdata/dbengine/datafile-1-0000001338.ndf".
2023-05-17 10:30:48: netdata INFO  : LIBUV_WORKER : DBENGINE: deleting data file '/var/cache/netdata/dbengine/datafile-1-0000001339.ndf'.
2023-05-17 10:30:48: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000001339.njf".
2023-05-17 10:30:48: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000001339.njfv2".
2023-05-17 10:30:48: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted data file "/var/cache/netdata/dbengine/datafile-1-0000001339.ndf".

2023-05-17 05:39:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleting data file '/var/cache/netdata/dbengine-tier1/datafile-1-0000000134.ndf'.
2023-05-17 05:39:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine-tier1/journalfile-1-0000000134.njf".
2023-05-17 05:39:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine-tier1/journalfile-1-0000000134.njfv2".
2023-05-17 05:39:06: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted data file "/var/cache/netdata/dbengine-tier1/datafile-1-0000000134.ndf".
2023-05-17 09:11:05: netdata INFO  : LIBUV_WORKER : DBENGINE: deleting data file '/var/cache/netdata/dbengine-tier1/datafile-1-0000000135.ndf'.
2023-05-17 09:11:05: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine-tier1/journalfile-1-0000000135.njf".
2023-05-17 09:11:05: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted journal file "/var/cache/netdata/dbengine-tier1/journalfile-1-0000000135.njfv2".
2023-05-17 09:11:05: netdata INFO  : LIBUV_WORKER : DBENGINE: deleted data file "/var/cache/netdata/dbengine-tier1/datafile-1-0000000135.ndf".

As you can see disk space limits are not followed by netdata.

Is it kind of a bug or I do something wrong?

===

netdata -W buildinfo
Version: netdata v1.39.0
Configure options:  '--prefix=/usr' '--sbindir=/usr/bin' '--sysconfdir=/etc' '--libexecdir=/usr/lib' '--localstatedir=/var' '--with-zlib' '--with-math' '--with-user=netdata' 'CFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -g -ffile-prefix-map=/build/netdata/src=/usr/src/debug/netdata -flto=auto' 'LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto' 'CXXFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -g -ffile-prefix-map=/build/netdata/src=/usr/src/debug/netdata -flto=auto'
Install type: custom
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  YES
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    NO
    IPMI:                    NO
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 YES
    Prometheus Remote Write: YES
Debug/Developer Features:
    Trace Allocations:       NO

Lex · May 19, 2023, 11:00am

While my account was on hold I found the root cause.
The “issue” is how Netdata calculate used disk space.

It counts apparent file sizes which does not work as expected on file system with compression (I have ZFS with zstd enabled):
As example compare these two outputs for the same folders:

$ du -sh /var/cache/netdata/dbengine*
68G     /var/cache/netdata/dbengine
43G     /var/cache/netdata/dbengine-tier1
4.0G    /var/cache/netdata/dbengine-tier2

$ du -bh /var/cache/netdata/dbengine*
90G     /var/cache/netdata/dbengine
68G     /var/cache/netdata/dbengine-tier1
17G     /var/cache/netdata/dbengine-tier2

How it should be is contradictory. I prefer to limit real space used on storage as “df” shows.

Austin_Hemmelgarn · May 23, 2023, 12:12pm

Based on internal discussion, we’ve determined that for the time being, this is considered to be working as intended because:

Because of the way that many filesystems that support transparent compression on UNIX-like systems (at minimum ZFS, BTRFS, and F2FS) handle compression, this behavior still provides a relatively strict upper bound on actual disk usage. IOW, while Netdata might use less physical space than this, it should never use more, so the constraints on disk usage are still being met.
Physical space usage tracking is more complicated than it sounds (it runs into some interesting interactions with preallocated extents on some Linux filesystems, for example), and is also kind of resource intensive compared to what we’re doing currently.

We do, however, recognize that we need to clearly document this behavior so that people know about it without having to dig into the source code or inspect on-disk data.

Topic		Replies	Views
dbengine not respecting configured disk usage limits Help	2	1060	November 14, 2022
dbengine directory growing despite configured disk usage limits Help	10	1452	February 2, 2023
Lack of documents on "dbengine multihost disk space" Help agent-dbengine , agent	6	1817	September 5, 2020
Netdata dbengine not keeping enough data Help agent-dbengine , agent	9	912	April 27, 2021
dbengine growing above configured limits (1.46.2) Help agent , dbengine , configuration	2	50	August 8, 2024

Netdata does not respect disk space limits

Related topics