Hi all,
I run several CentOS 8 Stream KVM servers with a variety of production VMs on them. I run netdata on the KVM nodes themselves. Earlier this afternoon, the netdata instances on all of my KVM nodes began to raise critical “10min_dbengine_global_fs_errors” alerts.
I checked the netdata error.log on each KVM node and found that they are flooded with this error message:
2023-06-16 18:55:33: netdata ERROR : LIBUV_WORKER : DBENGINE: error while reading extent from datafile 4037 of tier 0, at offset 2809856 (43348 bytes) to extract page (PD) from 1686945705 (2023-06-16 13:01:45) to 1686946728 (2023-06-16 13:18:48) of metric 8dbcd724-ac64-48f5-a556-de6b5542388a: header is INVALID (errno 22, Invalid argument)
I tried restarting netdata, but it didn’t help. I then upgraded netdata to the latest version, but that didn’t help either.
I thought the errors might be related to hitting some kind of max open file limit, but that doesn’t seem to be the cause either:
[root@sea4 ~]# cat /proc/sys/fs/file-nr
2960 0 6514653
[root@sea4 ~]#
I’m not an expert on file system limits, so it’s possible I’m not using the correct command to investigate possible filesystem limits…
Could anyone recommend some next steps to troubleshoot and identify the root cause of this issue? It’s a little concerning that this started happening to all of my production servers at once and in the absence of any upgrades, deployments, etc.
Thank you!