history in dbengine.. lost?

Bernd · April 8, 2023, 6:44am

Once again with dbengine
netdata version: v1.38.0-354-nightly
Debian Bullseye
My netdata.conf printed by netdata itself:

[db]
	dbengine multihost disk space MB = 256
	dbengine page cache size MB = 32
	storage tiers = 3
	dbengine parallel initialization = yes
	dbengine tier 1 multihost disk space MB = 128
	dbengine tier 1 update every iterations = 60
	dbengine tier 1 backfill = new
	dbengine tier 2 multihost disk space MB = 64
	dbengine tier 2 update every iterations = 60
	dbengine tier 2 backfill = new
	# update every = 1
	# mode = dbengine
	# dbengine extent cache size MB = 0
	# dbengine enable journal integrity check = no
	# dbengine disk space MB = 256
	# memory deduplication (ksm) = yes
	# cleanup obsolete charts after secs = 3600
	# gap when lost iterations above = 1
	# enable replication = yes
	# seconds to replicate = 86400
	# seconds per replication step = 600
	# cleanup orphan hosts after secs = 3600
	# dbengine use direct io = yes
	# dbengine pages per extent = 64
	# delete obsolete charts files = yes
	# delete orphan hosts files = yes
	# enable zero metrics = no
	# replication threads = 1

The Problem is that the history is only about 8 Days
So what might the reason?
I had much more days in the past.
I will let the netdata.conf untouch and watch the history if no one hs an idea whats going on

Christopher_Akritid1 · April 10, 2023, 7:07pm

Can you please add a screenshot of the chart netdata.dbengine_metrics? It’s the first one under “Netdata Monitoring” → “dbengine metrics”

Retention depends very much on the cardinality of the collected metrics. However, tt would be very extremely unusual for you to have such a high number of metrics to only be able to store 8 days. So please also look for any hints in error.log of anything that might have gone at the exact time of the first data you see. Something like a corruption, missing files, anything like that.

Bernd · April 10, 2023, 7:37pm

Thank you for your care

In the log file there is no specil message with dbengine visible for me, but I could miss something

Bernd · April 11, 2023, 7:48am

just in the Morning:

023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: creating new data and journal files in path /var/cache/netdata/dbengine
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: created data file "/var/cache/netdata/dbengine/datafile-1-0000003812.ndf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: created journal file "/var/cache/netdata/dbengine/journalfile-1-0000003812.njf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: journal file 3811 is ready to be indexed
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: recalculating tier 0 retention for 12907 metrics starting with datafile 3778
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: indexing file '/var/cache/netdata/dbengine/journalfile-1-0000003811.njfv2': extents 234, metrics 12822, pages 14976
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: migrated journal file '/var/cache/netdata/dbengine/journalfile-1-0000003811.njfv2', file size 1248196
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: updating tier 0 metrics registry retention for 12907 metrics
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleting data file '/var/cache/netdata/dbengine/datafile-1-0000003777.ndf'.
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleting data and journal files to maintain disk quota
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000003777.njf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000003777.njfv2".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleted data file "/var/cache/netdata/dbengine/datafile-1-0000003777.ndf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: reclaimed 7546472 bytes of disk space.

Christopher_Akritid1 · April 11, 2023, 1:40pm

You have a low number of metrics overall (40k/3 ~= 13k metrics, validated by the log message to be exactly 12907).

Your tiering configuration is sufficient for long term storage of many years for that data.

The only way you can possibly not have data beyond 8 days ago is that the database was completely wiped 8 days ago. I suggest you make available the entire error.log for us to look at. I see that whatever happened, it was on April 1st, I hope you didn’t have a colleague playing a joke on you!

Bernd · April 11, 2023, 3:22pm

ok, I will watch how the history develops.this is a bare metal server, I am solely responsible, so I exclude an april joke.
Somehow there must have been an accidental deletion, I agree.
But there was no *.nightly that could have triggered something like that?
Nothing that would be vital now, it’s just weird.
But, loog at this:
3rd April. something is strange

Bernd · April 11, 2023, 3:29pm

can I provide the whole error log gziped to an email wihout publishing it in the forum?

Bernd · April 13, 2023, 7:54am

the history is still moving, only 7 days are left.
I would like to upload a 36K compressed log file, how can I do this?

Look at this:
x.log.gz

ilyam8 · April 13, 2023, 11:52am

Related feature request.

ilyam8 · April 13, 2023, 12:16pm

@Bernd can you share http://127.0.0.1:19999/api/v1/dbengine_stats response?

Bernd · April 13, 2023, 12:40pm

sure:

{
	"tier0": {
		"default_granularity_secs":1,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":4,
		"sizeof_page_data":4096,
		"pages_per_extent":64,
		"datafiles":35,
		"extents":7667,
		"extents_pages":490687,
		"points":488409334,
		"metrics":440786,
		"metrics_pages":490687,
		"extents_compressed_bytes":167780659,
		"pages_uncompressed_bytes":1953637336,
		"pages_duration_secs":491254420,
		"single_point_pages":9,
		"first_t":1681274911,
		"last_t":1681389562,
		"database_retention_secs":114651,
		"average_compression_savings":91.41,
		"average_point_duration_secs":1.01,
		"average_metric_retention_secs":1114.50,
		"ephemeral_metrics_per_day_percent":7678.43,
		"average_page_size_bytes":3981.43,
		"estimated_concurrently_collected_metrics":4284,
		"currently_collected_metrics":13383,
		"disk_space":257861916,
		"max_disk_space":268435456
	},
	"tier1": {
		"default_granularity_secs":60,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":16,
		"sizeof_page_data":2048,
		"pages_per_extent":64,
		"datafiles":18,
		"extents":5880,
		"extents_pages":376256,
		"points":24234795,
		"metrics":211844,
		"metrics_pages":376256,
		"extents_compressed_bytes":56823441,
		"pages_uncompressed_bytes":387756720,
		"pages_duration_secs":1462351740,
		"single_point_pages":231,
		"first_t":1681101720,
		"last_t":1681379640,
		"database_retention_secs":277920,
		"average_compression_savings":85.35,
		"average_point_duration_secs":60.34,
		"average_metric_retention_secs":6902.97,
		"ephemeral_metrics_per_day_percent":1220.73,
		"average_page_size_bytes":1030.57,
		"estimated_concurrently_collected_metrics":5261,
		"currently_collected_metrics":13383,
		"disk_space":128530896,
		"max_disk_space":134217728
	},
	"tier2": {
		"default_granularity_secs":3600,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":16,
		"sizeof_page_data":384,
		"pages_per_extent":64,
		"datafiles":8,
		"extents":4095,
		"extents_pages":261886,
		"points":2492465,
		"metrics":92096,
		"metrics_pages":261886,
		"extents_compressed_bytes":17603489,
		"pages_uncompressed_bytes":39879440,
		"pages_duration_secs":9007466400,
		"single_point_pages":19797,
		"first_t":1680660000,
		"last_t":1681362000,
		"database_retention_secs":702000,
		"average_compression_savings":55.86,
		"average_point_duration_secs":3613.88,
		"average_metric_retention_secs":97805.19,
		"ephemeral_metrics_per_day_percent":76.03,
		"average_page_size_bytes":152.28,
		"estimated_concurrently_collected_metrics":12831,
		"currently_collected_metrics":13383,
		"disk_space":59826908,
		"max_disk_space":67108864
	}
}

ilyam8 · April 13, 2023, 1:58pm

Removed not related to retention time

{
	"tier0": {
		"database_retention_secs":114651,
	},
	"tier1": {
		"database_retention_secs":277920,
	},
	"tier2": {
		"database_retention_secs":702000,
	}
}

tier0: 114651 secs => 1.3 days
tier1: 277920 secs => 3.2 days
tier2: 702000 secs => 8 days

For longer retention time increase the following values:

dbengine multihost disk space MB (tier0)
dbengine tier 1 multihost disk space MB (tier1)
dbengine tier 2 multihost disk space MB (tier2)

Christopher_Akritid1 · April 13, 2023, 3:46pm

I was wrong to say it should store years, but the number of metrics shown on the chart does not match what this endpoint returns.

Based on the chart we saw, the calculation would be
64MB on disk / 4 bytes per point / 13000 metrics => 1.2k points per metric / 24 hr per day ~= 50 days

In reality, the metrics we should use are 92096, as the endpoint shows
64MB on disk / 4 bytes per point / 92096 metrics => 1.2k points per metric / 24 hr per day ~= 7.2 days

I’m updating the documentation, to instruct people to use that endpoint in the future.

Bernd · April 13, 2023, 5:39pm

Thank you both. Do you mind writing me now exactly where and what I have to enter.
What do I have to enter now if I want to have a considerably longer history available?
I can not map your advice to my netdata.conf.
The current config is

[db]
        dbengine multihost disk space MB = 256
        dbengine page cache size MB = 32
        storage tiers = 3
         dbengine parallel initialization = yes
         dbengine tier 1 multihost disk space MB = 128
         dbengine tier 1 update every iterations = 60
         dbengine tier 1 backfill = new
         dbengine tier 2 multihost disk space MB = 64
         dbengine tier 2 update every iterations = 60
         dbengine tier 2 backfill = new

As I stated earlier Space is no concern…

ilyam8 · April 13, 2023, 6:09pm

Change

dbengine multihost disk space MB from 256 to 4096
dbengine tier 1 multihost disk space MB from 128 to 2048
dbengine tier 2 multihost disk space MB from 64 to 1024

And restart netdata service

corymosiman12 · May 23, 2024, 10:48pm

Hi there - having similar type issue. I followed the instructions in here.

I have a parent running in a cloud docker container:

6 child nodes on edge devices
Parent is running on 4 core, 8GB RAM, 100GB disk
Parent version: 1.45.3

Before update:

I then updated my netdata.conf

but looking at dbengine stats again, getting no meaningful increases

The service definition of my parent in compose looks like:

  netdata:
    image: netdata/netdata:stable
    container_name: netdata
    restart: unless-stopped
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined
    volumes:
      - netdataconfig:/etc/netdata
      - netdatalib:/var/lib/netdata
      - netdatacache:/var/cache/netdata
      - /etc/passwd:/host/etc/passwd:ro
      - /etc/group:/host/etc/group:ro
      - /etc/localtime:/etc/localtime:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
      - /var/log:/host/var/log:ro
      - ./stream.conf:/etc/netdata/stream.conf

Any thoughts / help appreciated

[ EDIT ]
This post may have been a bit premature. Hitting the endpoint now, the tier0 values are increasing. so maybe database_retention_secs is the current number of secs recorded? Is there a reference guide for the api/v1/dbengine_stats endpoint? I can’t seem to find the returned fields documented anywhere…so maybe I’m just not exactly sure what this is giving me.

Topic		Replies	Views
Change retention config to multi tiers Help agent , dbengine , configuration	7	691	December 13, 2022
dbengine directory growing despite configured disk usage limits Help	10	1468	February 2, 2023
Error netdata.dbengine_long_term_page_stats streaming / dbengine storage tiers Help agent , alerts	4	360	July 18, 2023
Streaming and retaining long term metrics Help agent , dbengine , configuration	10	1308	June 28, 2022
10min_dbengine_global_io_errors Alerts	20	5552	November 22, 2023

history in dbengine.. lost?

Related topics