10min_dbengine_global_io_errors

10min_dbengine_global_io_errors

Netdata | DB engine

The Database Engine works like a traditional database. It dedicates a certain amount of RAM to data caching and indexing, while the rest of the data resides compressed on disk. Unlike other memory modes, the amount of historical metrics stored is based on the amount of disk space you allocate and the effective compression ratio, not a fixed number of metrics collected.

By using both RAM and disk space, the database engine allows for long-term storage of per-second metrics inside of the Agent itself.

The Netdata Agent monitors the number of IO errors in the last 10 minutes. The dbengine is experiencing I/O errors (CRC errors, out of space, bad disk, etc.).

This alert is triggered in critical state when the number of IO errors is greater that 0.

See more about DB engine

Hey,

I keep getting this error (and the following (10min_dbengine_global_fs_errors) as well on my Proxmox node. I have an LXC (container) installed on it, where NetData is running and feeding netdata cloud, and similarly installed on the node and feeding netdata cloud as well.

It happens that often when this error happens the machine then crash and is unreachable.

I initially thought it could be an issue with my NVME ssd but it doesn’t seem to be the case (it’s new, and the smartctl show no errors).

Would you be able to help me and point me in the right direction of how to fix that error ?

hmm @Manolis_Vasilakis any ideas on this one?

Thanks for allowing mypost.
It seems like it happened again today. I’m very confused as to what could be triggering this error (and the following crash). I’ve now turned off pretty much everything on that server (i still have one LXC container running, and net data), but it’s pretty lightly loaded now.

Hmm, not sure at first.

Are there any other similar errors, or just that? It could be that dbengine is detecting some underlining issue on the filesystem (it could be just the filesystem and not the hardware itself). Do you know which filesystem it uses?

Is it possible to screenshot the netdata.dbengine_global_errors chart? For some time before this alert happens?

Yes, I mostly have these 2 types of errors. I do sometimes have 10min_disk_backlog warning or alerts.

When you say filesystem having an issue, how do you think i could fix it ? This system is my homelab so I could reformat/reinstall it all if it could help with this issue.

Let me know if there’s anything else I can provide to help with the investigation. Sincerely appreciate your help.

Critical, Netdata DBengine IO errors = 7 errors
Alert: 10min_dbengine_global_io_errors
Chart: netdata.dbengine_global_errors
Raised to critical, for 0 second

On Mon Oct 9 13:14:36 CEST 2023
By: hostname
Global time: Mon Oct 9 11:14:36 UTC 2023

Classification: Errors
Role: sysadmin

Some things to check would likely be system logs (journalctl) or maybe smart utilities for the disks to see if something is reported as wrong in general.

I’m getting this notification pretty much daily - with between 1 and 3 errors. A few minutes later it recovers, so I’m not sure if this is an issue or not… Any guidance?

I removed any trace of netdata and install again (not stable this time) and it works.

Also, @draki is it possible please to share your error.log with manolis@netdata.cloud ?

Could you check for the existance of e.g. DBENGINE log lines with errors in it?

@draki thanks for the report.

We’re still not clear what could cause it. It could be some internal dbengine problem that in-correctly being identified as a filesystem or io error.

In the mean time, you could create a custom alert from it, and increase the threashold to raise the alert from 0 to e.g. 10.

Found the error.log

cat error.log | grep dbengine
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004211.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004211.njfv2' loaded, size: 0.59 MiB, metrics: 3.62 k, mmap: 0.04 ms, validate: 0.02 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004212.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004212.ndf" initialized (size:5378048).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004212.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004212.njfv2' loaded, size: 0.61 MiB, metrics: 3.71 k, mmap: 0.03 ms, validate: 0.03 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004213.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004213.ndf" initialized (size:5373952).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004213.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004213.njfv2' loaded, size: 0.61 MiB, metrics: 3.65 k, mmap: 0.03 ms, validate: 0.02 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004214.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004214.ndf" initialized (size:5378048).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004214.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004214.njfv2' loaded, size: 0.60 MiB, metrics: 3.64 k, mmap: 0.04 ms, validate: 0.02 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004215.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004215.ndf" initialized (size:5378048).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004215.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004215.njfv2' loaded, size: 0.59 MiB, metrics: 3.70 k, mmap: 0.04 ms, validate: 0.02 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004216.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004216.ndf" initialized (size:5394432).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004216.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004216.njfv2' loaded, size: 0.61 MiB, metrics: 3.63 k, mmap: 0.04 ms, validate: 0.03 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004217.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004217.ndf" initialized (size:5369856).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004217.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004217.njfv2' loaded, size: 0.60 MiB, metrics: 3.65 k, mmap: 0.04 ms, validate: 0.03 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004218.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004218.ndf" initialized (size:5394432).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: checking integrity of '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004218.njfv2'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal v2 '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004218.njfv2' loaded, size: 0.61 MiB, metrics: 3.69 k, mmap: 0.04 ms, validate: 0.02 ms
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: initializing data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004219.ndf".
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004219.ndf" initialized (size:3174400).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: loading journal file '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004219.njf'
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[2] : DBENGINE: journal file '/opt/netdata/var/cache/netdata/dbengine-tier2/journalfile-1-0000000066.njf' loaded (size:1024000).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[1] : DBENGINE: journal file '/opt/netdata/var/cache/netdata/dbengine-tier1/journalfile-1-0000000683.njf' loaded (size:536576).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: journal file '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004219.njf' loaded (size:335872).
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[2] : DBENGINE: indexing file '/opt/netdata/var/cache/netdata/dbengine-tier2/journalfile-1-0000000066.njfv2': extents 241, metrics 3963, pages 15397
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: indexing file '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004219.njfv2': extents 81, metrics 3542, pages 5184
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[1] : DBENGINE: indexing file '/opt/netdata/var/cache/netdata/dbengine-tier1/journalfile-1-0000000683.njfv2': extents 130, metrics 3549, pages 8320
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[0] : DBENGINE: migrated journal file '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004219.njfv2', file size 402484
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[2] : DBENGINE: migrated journal file '/opt/netdata/var/cache/netdata/dbengine-tier2/journalfile-1-0000000066.njfv2', file size 951276
2023-10-20 07:15:56: netdata INFO  : DBENGINIT[1] : DBENGINE: migrated journal file '/opt/netdata/var/cache/netdata/dbengine-tier1/journalfile-1-0000000683.njfv2', file size 566592
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[0] : DBENGINE: creating new data and journal files in path /opt/netdata/var/cache/netdata/dbengine
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[0] : DBENGINE: created data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004220.ndf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[0] : DBENGINE: created journal file "/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004220.njf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[1] : DBENGINE: creating new data and journal files in path /opt/netdata/var/cache/netdata/dbengine-tier1
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[1] : DBENGINE: created data file "/opt/netdata/var/cache/netdata/dbengine-tier1/datafile-1-0000000684.ndf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[1] : DBENGINE: created journal file "/opt/netdata/var/cache/netdata/dbengine-tier1/journalfile-1-0000000684.njf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[1] : DBENGINE: deleting data file '/opt/netdata/var/cache/netdata/dbengine-tier1/datafile-1-0000000661.ndf'.
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[1] : DBENGINE: deleted journal file "/opt/netdata/var/cache/netdata/dbengine-tier1/journalfile-1-0000000661.njf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[1] : DBENGINE: deleted journal file "/opt/netdata/var/cache/netdata/dbengine-tier1/journalfile-1-0000000661.njfv2".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[1] : DBENGINE: deleted data file "/opt/netdata/var/cache/netdata/dbengine-tier1/datafile-1-0000000661.ndf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[2] : DBENGINE: creating new data and journal files in path /opt/netdata/var/cache/netdata/dbengine-tier2
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[2] : DBENGINE: created data file "/opt/netdata/var/cache/netdata/dbengine-tier2/datafile-1-0000000067.ndf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[2] : DBENGINE: created journal file "/opt/netdata/var/cache/netdata/dbengine-tier2/journalfile-1-0000000067.njf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[2] : DBENGINE: deleting data file '/opt/netdata/var/cache/netdata/dbengine-tier2/datafile-1-0000000054.ndf'.
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[2] : DBENGINE: deleted journal file "/opt/netdata/var/cache/netdata/dbengine-tier2/journalfile-1-0000000054.njf".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[2] : DBENGINE: deleted journal file "/opt/netdata/var/cache/netdata/dbengine-tier2/journalfile-1-0000000054.njfv2".
2023-10-20 07:15:57: netdata INFO  : DBENGINIT[2] : DBENGINE: deleted data file "/opt/netdata/var/cache/netdata/dbengine-tier2/datafile-1-0000000054.ndf".
2023-10-20 07:15:57: netdata INFO  : MAIN : Host 'raspb-pi-grafana' (at registry as 'raspb-pi-grafana') with guid '74437858-fc59-11ed-8056-e45f01c2d7e2' initialized, os 'linux', timezone 'Europe/London', tags '', program_name 'netdata', program_version 'v1.43.0-56-gfc3251619', update every 1, memory mode dbengine, history entries 0, streaming disabled (to '' with api key ''), health enabled, cache_dir '/opt/netdata/var/cache/netdata', alarms default handler '', alarms default recipient ''
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[20] : DBENGINE: creating new data and journal files in path /opt/netdata/var/cache/netdata/dbengine
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[20] : DBENGINE: created data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004221.ndf".
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[20] : DBENGINE: created journal file "/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004221.njf".
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[9] : DBENGINE: indexing file '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004220.njfv2': extents 154, metrics 3394, pages 9856
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[21] : DBENGINE: deleting data file '/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004180.ndf'.
2023-10-20 08:05:44: netdata ERROR : UV_WORKER[21] : DBENGINE: uv_fs_fsunlink(/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004180.njf): no such file or directory (errno 2, No such file or directory)
2023-10-20 08:05:44: netdata ERROR : UV_WORKER[21] : DBENGINE: uv_fs_fsunlink(/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004180.njf): no such file or directory (errno 2, No such file or directory)
2023-10-20 08:05:44: netdata ERROR : UV_WORKER[21] : DBENGINE: uv_fs_fsunlink(/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004180.ndf): no such file or directory (errno 2, No such file or directory)
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[9] : DBENGINE: migrated journal file '/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004220.njfv2', file size 641268
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[1] : DBENGINE: deleting data file '/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004181.ndf'.
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[1] : DBENGINE: deleted journal file "/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004181.njf".
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[1] : DBENGINE: deleted journal file "/opt/netdata/var/cache/netdata/dbengine/journalfile-1-0000004181.njfv2".
2023-10-20 08:05:44: netdata INFO  : UV_WORKER[1] : DBENGINE: deleted data file "/opt/netdata/var/cache/netdata/dbengine/datafile-1-0000004181.ndf".

Hi @Manolis_Vasilakis - thanks for the reply - I’ll create a custom alert…
re: the logs - oddly I can’t find a directory /netdata under /var/log/ - I installed using the install script - do I need to enable logging somewhere? (I haven’t dug into the detail of configuring netdata as it ‘just works’ :slight_smile: )

Hi @draki thanks for the logs.

So, it appears that it’s trying to delete journalfile-1-0000004180.njf and datafile-1-0000004180.ndf, but they are already deleted… Not sure why, but these fails would raise those alerts eventually (but these ones would be fs alerts, not io).

One way for this to occur is if another instance of the agent is running. Could you check that this is not the case? (i.e. by checking with ps) ?

Do you see similar errors in other days as well? (i.e. in error.log.1, etc)…

Hi @Manolis_Vasilakis
Ah - yes, it looks like I have 2 instances running somehow

ps -f -C netdata
UID        PID  PPID  C STIME TTY          TIME CMD
netdata  15898     1  4 17:59 ?        00:01:17 /opt/netdata/bin/srv/netdata -P /run/netdata/netdata.pid -D
netdata  15900 15898  0 17:59 ?        00:00:00 /opt/netdata/bin/srv/netdata --special-spawn-server
netdata  27713     1  3 Oct19 ?        04:58:14 /opt/netdata/bin/srv/netdata
netdata  27716 27713  0 Oct19 ?        00:00:00 /opt/netdata/bin/srv/netdata --special-spawn-server

how do you suggest I prevent this?
thanks for the help!

Hi @draki

Not much you can do from your side (we need to make sure this doesn’t happen), but try to kill both and then start normally.

Likely I’m afraid this has caused some issues with your dbengine, since 2 instances are writing to the same data files. One suggestion would be to delete /var/cache/netdata/* and start fresh.

Sorry this affected you.

1 Like

No worries - thanks Manolis

Hello,

same here, I have DBENGINE errors global_io_errors and global_fs_errors. looking at the logs, it tries to delete a file that is already deleted. and I also have 2 instances:

ps -f -C netdata
UID        PID  PPID  C STIME TTY          TIME CMD
netdata  17158     1  2 Nov07 ?        00:45:22 /opt/netdata/bin/srv/netdata
netdata  17162 17158  0 Nov07 ?        00:00:00 /opt/netdata/bin/srv/netdata --special-spawn-server
netdata  18817     1  2 06:53 ?        00:06:19 /opt/netdata/bin/srv/netdata -P /run/netdata/netdata.pid -D
netdata  18820 18817  0 06:53 ?        00:00:00 /opt/netdata/bin/srv/netdata --special-spawn-server

@Manolis_Vasilakis sorry for intervening, @draki what’s your platform?