Suddenly receiving critical "10min_dbengine_global_fs_errors" alerts across multiple CentOS 8 Stream KVM nodes

Hi all,

I run several CentOS 8 Stream KVM servers with a variety of production VMs on them. I run netdata on the KVM nodes themselves. Earlier this afternoon, the netdata instances on all of my KVM nodes began to raise critical “10min_dbengine_global_fs_errors” alerts.

I checked the netdata error.log on each KVM node and found that they are flooded with this error message:

2023-06-16 18:55:33: netdata ERROR : LIBUV_WORKER : DBENGINE: error while reading extent from datafile 4037 of tier 0, at offset 2809856 (43348 bytes) to extract page (PD) from 1686945705 (2023-06-16 13:01:45) to 1686946728 (2023-06-16 13:18:48) of metric 8dbcd724-ac64-48f5-a556-de6b5542388a: header is INVALID (errno 22, Invalid argument)

I tried restarting netdata, but it didn’t help. I then upgraded netdata to the latest version, but that didn’t help either.

I thought the errors might be related to hitting some kind of max open file limit, but that doesn’t seem to be the cause either:

[root@sea4 ~]# cat /proc/sys/fs/file-nr
2960 0 6514653
[root@sea4 ~]#

I’m not an expert on file system limits, so it’s possible I’m not using the correct command to investigate possible filesystem limits…

Could anyone recommend some next steps to troubleshoot and identify the root cause of this issue? It’s a little concerning that this started happening to all of my production servers at once and in the absence of any upgrades, deployments, etc.

Thank you!

Hi @EB68

What version of netdata are you running? Is this a new, old install, any recent updates?

This ended up being a bug with whatever Netdata version was released on June 15th/16th nightly edge build.

The issue cleared itself up uniformly across all of my servers after Netdata processed an automatic upgrade.

Not sure what it was, but it definitely appears to have been something internal to Netdata and not a filesystem issue.