Agent monitoring, 10min_dbengine_global_fs_errors alert with low number of errors

Since we’ve updated to version 1.33.0 we see the following graph on one of our hosts and also receive alerts about this regularly:

2022-01-28_08-36

All other hosts, configured similarly, do not show this error. Any hints what that means and how we can troubleshoot that?

Hello @jurgenhaas, this chart is projecting 3 metrics {io_errors, fs_errors, pg_cache_over_half_dirty_events}, my instinct tells me that you received the 10min_dbengine_global_fs_errors alert and you just projecting the fs_errors metric. Could you verify this info? Which alerts you received?

Hi @Tasos_Katsoulas you’re right, this is the alert we received, and I don’t see any issues on the host to find out more about it, so I wonder where we should be looking. Or is it a false positive?

Could you please check/answer the following:

  • Node’s OS, version of kernel
  • Share your Netdata version and build (run netdata -W buildinfo)
  • Are you using the default netdata.conf (have you made any changes for this node as far as the dbengine?)
  • Check the following netdata.dbengine_global_file_descriptors chart, see if the current file descriptors surpass the maxfile descriptors when this alert is triggered. If yes, follow this Database engine | Learn Netdata section in our docs
  • Check the netdata error logs cat /var/log/netdata/error.log | grep error for errors in the dbengine
  • Does this node have child nodes streaming metrics to it? If yes please follow this Change how long Netdata stores metrics | Learn Netdata guide to verify that at least the min resource requirements are met for this node

edited: Start with the above and we will investigate further if needed.

Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-166-generic x86_64)

Version: netdata v1.33.0
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: unknown
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

I’ve only changed the bind to in the web section. No changes to the dbengine settings.

The values there are constantly the same: 256 max and 50 current.

There I can find lots of these:

2022-01-31 08:34:32: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-01-31 08:34:32: netdata ERROR : MAIN : Failed to open file "/var/cache/netdata/dbengine/datafile-1-0000000016.ndf". (errno 2, No such file or directory)
2022-01-31 08:34:32: netdata ERROR : MAIN : Cannot delete data file "/var/cache/netdata/dbengine/datafile-1-0000000015.ndf" to reclaim space, there are no other file pairs left.
2022-01-31 08:34:33: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-01-31 08:34:33: netdata ERROR : MAIN : Failed to open file "/var/cache/netdata/dbengine/datafile-1-0000000016.ndf". (errno 2, No such file or directory)
2022-01-31 08:34:33: netdata ERROR : MAIN : Cannot delete data file "/var/cache/netdata/dbengine/datafile-1-0000000015.ndf" to reclaim space, there are no other file pairs left.
2022-01-31 08:34:34: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-01-31 08:34:34: netdata ERROR : MAIN : Failed to open file "/var/cache/netdata/dbengine/datafile-1-0000000016.ndf". (errno 2, No such file or directory)
2022-01-31 08:34:34: netdata ERROR : MAIN : Cannot delete data file "/var/cache/netdata/dbengine/datafile-1-0000000015.ndf" to reclaim space, there are no other file pairs left.

What’s interesting, that there is no directory /var/cache/netdata on this host.

No, it doesn’t.

Ok, your Netdata agent is running only with the cached (in ram) metrics if /var/cache/netdata does not exist, have you tried to restart the agent? If you do so, does it fix the problem? Follow this Start, stop, or restart the Netdata Agent | Learn Netdata section in our docs.

Yes, restarting the agent did indeed re-create the cache directory and it seems that errors disappeared.

Maybe the agent should also recreate the cache directory while running, if e.g. a cleanup process dump that directory for maintenance?

There is always room for improvement, thank you for your input. I can’t be sure why your directory was deleted, but I don’t think that this is very common because by deleting this directory you actually lose the Agent’s database. We will monitor it.

Please be so kind and rename the topic name into Agent monitoring, 10min_dbengine_global_fs_errors alert with low number of errors

In the future, If you receive an alert that you may not know why is triggered, or what is about, we created a discussion space (WIP) in our community forum for all the alerts. Ask us anything :slight_smile:

Really? A database in a cache directory? That sound unusual. Cache directories may get wiped for maintenance in some scenarios.

Done

This is nice, thanks for the link.

That sound unusual. Cache directories may get wiped for maintenance in some scenarios.

I might agree with you but:

  1. These are temporary databases, they will be deleted (oldest first) when you reach the max limit of dbengine’s allocated space.
  2. Most of the apps that save data into /var/cache have their own procedures to clean up the cache.

If you want to keep long term metrics you can always export the metrics to a database Exporting reference | Learn Netdata or periodically backup the directory.

We almost start running in circles. I’m not interested in long term metrics - and I’m not argueing the location of your database.

What this was all about is that your agent is failing if someone is doing maintenance work like e.g. deleting cache directories. As a software developer I would think one should make sure, the cache directory exists if you wanted to write to it. And the netdata agent seems to check this during start up but not during operations. And that’s all my suggestion, to either check this always or produce an error message that explains, that an agent restart would be necessary.

I totally understand your points we will discuss your feedback with Agent development team. Thank you for your commends.