Since we’ve updated to version 1.33.0 we see the following graph on one of our hosts and also receive alerts about this regularly:
All other hosts, configured similarly, do not show this error. Any hints what that means and how we can troubleshoot that?
Hello @jurgenhaas, this chart is projecting 3 metrics {io_errors, fs_errors, pg_cache_over_half_dirty_events}, my instinct tells me that you received the 10min_dbengine_global_fs_errors
alert and you just projecting the fs_errors
metric. Could you verify this info? Which alerts you received?
Hi @Tasos_Katsoulas you’re right, this is the alert we received, and I don’t see any issues on the host to find out more about it, so I wonder where we should be looking. Or is it a false positive?
Could you please check/answer the following:
- Node’s OS, version of kernel
- Share your Netdata version and build (run
netdata -W buildinfo
)
- Are you using the default netdata.conf (have you made any changes for this node as far as the dbengine?)
- Check the following netdata.dbengine_global_file_descriptors chart, see if the
current
file descriptors surpass the max
file descriptors when this alert is triggered. If yes, follow this Database engine | Learn Netdata section in our docs
- Check the netdata error logs
cat /var/log/netdata/error.log | grep error
for errors in the dbengine
- Does this node have child nodes streaming metrics to it? If yes please follow this Change how long Netdata stores metrics | Learn Netdata guide to verify that at least the min resource requirements are met for this node
edited: Start with the above and we will investigate further if needed.
Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-166-generic x86_64)
Version: netdata v1.33.0
Configure options: '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: unknown
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: YES
ACLK-NG New Cloud Protocol: YES
ACLK Legacy: NO
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: NO
Libraries:
protobuf: YES (system)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: YES
EBPF: YES
IPMI: YES
NFACCT: YES
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: YES
I’ve only changed the bind to
in the web section. No changes to the dbengine settings.
The values there are constantly the same: 256 max and 50 current.
There I can find lots of these:
2022-01-31 08:34:32: netdata INFO : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-01-31 08:34:32: netdata ERROR : MAIN : Failed to open file "/var/cache/netdata/dbengine/datafile-1-0000000016.ndf". (errno 2, No such file or directory)
2022-01-31 08:34:32: netdata ERROR : MAIN : Cannot delete data file "/var/cache/netdata/dbengine/datafile-1-0000000015.ndf" to reclaim space, there are no other file pairs left.
2022-01-31 08:34:33: netdata INFO : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-01-31 08:34:33: netdata ERROR : MAIN : Failed to open file "/var/cache/netdata/dbengine/datafile-1-0000000016.ndf". (errno 2, No such file or directory)
2022-01-31 08:34:33: netdata ERROR : MAIN : Cannot delete data file "/var/cache/netdata/dbengine/datafile-1-0000000015.ndf" to reclaim space, there are no other file pairs left.
2022-01-31 08:34:34: netdata INFO : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-01-31 08:34:34: netdata ERROR : MAIN : Failed to open file "/var/cache/netdata/dbengine/datafile-1-0000000016.ndf". (errno 2, No such file or directory)
2022-01-31 08:34:34: netdata ERROR : MAIN : Cannot delete data file "/var/cache/netdata/dbengine/datafile-1-0000000015.ndf" to reclaim space, there are no other file pairs left.
What’s interesting, that there is no directory /var/cache/netdata
on this host.
No, it doesn’t.
Ok, your Netdata agent is running only with the cached (in ram) metrics if /var/cache/netdata
does not exist, have you tried to restart the agent? If you do so, does it fix the problem? Follow this Start, stop, or restart the Netdata Agent | Learn Netdata section in our docs.
Yes, restarting the agent did indeed re-create the cache directory and it seems that errors disappeared.
Maybe the agent should also recreate the cache directory while running, if e.g. a cleanup process dump that directory for maintenance?
There is always room for improvement, thank you for your input. I can’t be sure why your directory was deleted, but I don’t think that this is very common because by deleting this directory you actually lose the Agent’s database. We will monitor it.
Please be so kind and rename the topic name into Agent monitoring, 10min_dbengine_global_fs_errors alert with low number of errors
In the future, If you receive an alert that you may not know why is triggered, or what is about, we created a discussion space (WIP) in our community forum for all the alerts. Ask us anything
Really? A database in a cache directory? That sound unusual. Cache directories may get wiped for maintenance in some scenarios.
Done
This is nice, thanks for the link.
That sound unusual. Cache directories may get wiped for maintenance in some scenarios.
I might agree with you but:
- These are temporary databases, they will be deleted (oldest first) when you reach the max limit of dbengine’s allocated space.
- Most of the apps that save data into
/var/cache
have their own procedures to clean up the cache.
If you want to keep long term metrics you can always export the metrics to a database Exporting reference | Learn Netdata or periodically backup the directory.
We almost start running in circles. I’m not interested in long term metrics - and I’m not argueing the location of your database.
What this was all about is that your agent is failing if someone is doing maintenance work like e.g. deleting cache directories. As a software developer I would think one should make sure, the cache directory exists if you wanted to write to it. And the netdata agent seems to check this during start up but not during operations. And that’s all my suggestion, to either check this always or produce an error message that explains, that an agent restart would be necessary.
I totally understand your points we will discuss your feedback with Agent development team. Thank you for your commends.