Error netdata.dbengine_long_term_page_stats streaming / dbengine storage tiers

Hello,

I have a Netdata “master” which received streaming info from around 20 other servers .
After a while (maybe every 2 weeks) I receive this type of error for every servers:

SERVER is critical, `netdata.dbengine_long_term_page_stats` (*dbengine* ), **10min dbengine global flushing errors =**

I don’t really understand what this means …

My Netdata “master” configuration is the following:

[global]
  hostname = NETDATA-MASTER-PROD
  enabled = yes
[db]
  mode = dbengine
  update every = 1
  storage tiers = 3
  dbengine multihost disk space MB = 1100
  dbengine tier 1 multihost disk space MB = 330
  dbengine tier 2 multihost disk space MB = 67
  cache directory = /home/netdata
[web]
  bind to = *
  web files owner = root
  web files group = netdata
[directories]
  cache = /home/netdata
  home = /home/netdata

I have disk space available as well as ram / CPU on my Master:

Version of the master is

netdata v1.38.1

All servers are running on Ubuntu 18.01

Is this error caused by a miss configuration somewhere ?

Anyone have any idea what needs to change to avoid such issue ?

Thanks for your help !

DeWaRs

hi @DeWaRs1206 , The configuration look ok, but before we can dig into this further – would it be possible to upgrade to a more recent version of netdata?

Some additional questions

Do you receive this error on the parent node, or all or some of the children ?
What is the type of storage on those ?

Hello @Stelios_Fragkakis

Thansk for your answer. I’m using RPM package for the installation and 1.38 is the latest version available.
https://repo.netdata.cloud/repos/stable/ubuntu/bionic/

The error is on almost all children I would say, the alarm is sent by the parent.
The config on the children is the following:


[global]
  hostname = RABBIT-MEMCACHE-PROD
  enabled = yes
[web]
  bind to = *
  web files owner = root
  web files group = netdata

Version on the children is

root@test:~# netdata -v
netdata v1.37.0-17-nightly

Error log on the last impacted server:

2023-07-14 03:58:42: tc-qos-helper.sh: WARNING: FireQOS is not installed on this system. Use FireQOS to apply traffic QoS and expose the class names to netdata. Check https://github.com/netdata/netdata/tree/master/collectors/tc.plugin#tcplugin
2023-07-14 03:58:42: tc-qos-helper.sh: WARNING: Cannot find file '/usr/lib/netdata/conf.d/tc-qos-helper.conf'.
2023-07-14 03:58:42: tc-qos-helper.sh: WARNING: Cannot find file '/etc/netdata/tc-qos-helper.conf'.
2023-07-14 04:14:44: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2023-07-14 04:14:44: netdata INFO  : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000005933.ndf".
2023-07-14 04:14:44: netdata INFO  : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000005933.njf".
2023-07-14 04:31:49: netdata INFO  : MAIN : Deleting data file "/var/cache/netdata/dbengine/datafile-1-0000005915.ndf".
2023-07-14 04:31:49: netdata INFO  : MAIN : Deleting data and journal file pair.
2023-07-14 04:31:49: netdata INFO  : MAIN : Deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000005915.njf".
2023-07-14 04:31:49: netdata INFO  : MAIN : Deleted data file "/var/cache/netdata/dbengine/datafile-1-0000005915.ndf".
2023-07-14 04:31:49: netdata INFO  : MAIN : Reclaimed 14729216 bytes of disk space.
2023-07-14 04:35:30: netdata INFO  : MAIN : METADATA: Checking dimensions starting after row 0
2023-07-14 04:35:30: netdata INFO  : MAIN : METADATA: Checked 3464, deleted 0 -- will resume after row 0 in 3600 seconds
2023-07-14 04:58:44: tc-qos-helper.sh: WARNING: FireQOS is not installed on this system. Use FireQOS to apply traffic QoS and expose the class names to netdata. Check https://github.com/netdata/netdata/tree/master/collectors/tc.plugin#tcplugin
2023-07-14 04:58:44: tc-qos-helper.sh: WARNING: Cannot find file '/usr/lib/netdata/conf.d/tc-qos-helper.conf'.
2023-07-14 04:58:44: tc-qos-helper.sh: WARNING: Cannot find file '/etc/netdata/tc-qos-helper.conf'.
2023-07-14 05:11:42: netdata INFO  : ANALYTICS : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'META' '-' '-'
2023-07-14 05:35:31: netdata INFO  : MAIN : METADATA: Checking dimensions starting after row 0
2023-07-14 05:35:31: netdata INFO  : MAIN : METADATA: Checked 3464, deleted 0 -- will resume after row 0 in 3600 seconds
2023-07-14 05:57:14: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2023-07-14 05:57:14: netdata INFO  : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000005934.ndf".
2023-07-14 05:57:14: netdata INFO  : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000005934.njf".

About the type of storage, I’m not sure what you mean, maybe

root@ns3131326:/var/log/netdata# fdisk -k
fdisk: invalid option -- 'k'
Try 'fdisk --help' for more information.
root@ns3131326:/var/log/netdata# fdisk -l
Disk /dev/nvme0n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 52D39457-95A3-49AF-B66F-EEE7E5413E1E

Device             Start       End   Sectors   Size Type
/dev/nvme0n1p1      2048   1048575   1046528   511M EFI System
/dev/nvme0n1p2   1048576   2095103   1046528   511M Linux RAID
/dev/nvme0n1p3   2095104  43053055  40957952  19.5G Linux RAID
/dev/nvme0n1p4  43053056 878036991 834983936 398.2G Linux RAID
/dev/nvme0n1p5 878036992 879083519   1046528   511M Linux swap


Disk /dev/nvme1n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: E6F13B79-A2CC-42E2-A705-1F531DAB9A5B

Device             Start       End   Sectors   Size Type
/dev/nvme1n1p1      2048   1048575   1046528   511M EFI System
/dev/nvme1n1p2   1048576   2095103   1046528   511M Linux RAID
/dev/nvme1n1p3   2095104  43053055  40957952  19.5G Linux RAID
/dev/nvme1n1p4  43053056 878036991 834983936 398.2G Linux RAID
/dev/nvme1n1p5 878036992 879083519   1046528   511M Linux swap


Disk /dev/md2: 511 MiB, 535756800 bytes, 1046400 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/md3: 19.5 GiB, 20970405888 bytes, 40957824 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/md4: 398.2 GiB, 427511709696 bytes, 834983808 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Another information is, if I delete (which I just did) the folder /home/netdata which contains:

The alarm then stops, but will start again in a couple of weeks.

Best regards

Emmanuel

Hi Emmanuel,

Just so we’re on the same page. You mention RPM but then refer to our Ubuntu Bionic repository. Ubuntu uses Debian packaging, not RPMs. If you are indeed using Ubuntu Bionic 18.04, I should point out that this version has reached end-of-life on 2023-05-31. This is why there are no newer packages in the repository you referred to. There are ways to overcome that by using our static builds, but unless you are using Canonical’s Extended Security Maintenance channel as part of Ubuntu Pro, it might be prudent to upgrade your nodes to a newer version of Ubuntu.

The version on the child (1.37.0-17-nightly) is even older. What (version of the) OS is this running on? The file /etc/lsb-release should help answer this.

Regardless of the above, @Stelios_Fragkakis will try to help you further with the actual issue you mentioned.

All our servers are running Ubuntu 18.04 .
I will upgrade the Parent serveur in August, but for the rest it will take more time.
This morning, errors started to raise again. I disabled the following config

  dbengine tier 1 multihost disk space MB = 330
  dbengine tier 2 multihost disk space MB = 67

and restarted Netdata for now.

I you have any other clue until I update the version, please let me know :wink: