history in dbengine.. lost?

Once again with dbengine
netdata version: v1.38.0-354-nightly
Debian Bullseye
My netdata.conf printed by netdata itself:

[db]
	dbengine multihost disk space MB = 256
	dbengine page cache size MB = 32
	storage tiers = 3
	dbengine parallel initialization = yes
	dbengine tier 1 multihost disk space MB = 128
	dbengine tier 1 update every iterations = 60
	dbengine tier 1 backfill = new
	dbengine tier 2 multihost disk space MB = 64
	dbengine tier 2 update every iterations = 60
	dbengine tier 2 backfill = new
	# update every = 1
	# mode = dbengine
	# dbengine extent cache size MB = 0
	# dbengine enable journal integrity check = no
	# dbengine disk space MB = 256
	# memory deduplication (ksm) = yes
	# cleanup obsolete charts after secs = 3600
	# gap when lost iterations above = 1
	# enable replication = yes
	# seconds to replicate = 86400
	# seconds per replication step = 600
	# cleanup orphan hosts after secs = 3600
	# dbengine use direct io = yes
	# dbengine pages per extent = 64
	# delete obsolete charts files = yes
	# delete orphan hosts files = yes
	# enable zero metrics = no
	# replication threads = 1

The Problem is that the history is only about 8 Days :frowning:
So what might the reason?
I had much more days in the past.
I will let the netdata.conf untouch and watch the history if no one hs an idea whats going on

Can you please add a screenshot of the chart netdata.dbengine_metrics? It’s the first one under “Netdata Monitoring” → “dbengine metrics”

Retention depends very much on the cardinality of the collected metrics. However, tt would be very extremely unusual for you to have such a high number of metrics to only be able to store 8 days. So please also look for any hints in error.log of anything that might have gone at the exact time of the first data you see. Something like a corruption, missing files, anything like that.

Thank you for your care


In the log file there is no specil message with dbengine visible for me, but I could miss something

just in the Morning:

023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: creating new data and journal files in path /var/cache/netdata/dbengine
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: created data file "/var/cache/netdata/dbengine/datafile-1-0000003812.ndf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: created journal file "/var/cache/netdata/dbengine/journalfile-1-0000003812.njf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: journal file 3811 is ready to be indexed
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: recalculating tier 0 retention for 12907 metrics starting with datafile 3778
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: indexing file '/var/cache/netdata/dbengine/journalfile-1-0000003811.njfv2': extents 234, metrics 12822, pages 14976
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: migrated journal file '/var/cache/netdata/dbengine/journalfile-1-0000003811.njfv2', file size 1248196
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: updating tier 0 metrics registry retention for 12907 metrics
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleting data file '/var/cache/netdata/dbengine/datafile-1-0000003777.ndf'.
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleting data and journal files to maintain disk quota
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000003777.njf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleted journal file "/var/cache/netdata/dbengine/journalfile-1-0000003777.njfv2".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: deleted data file "/var/cache/netdata/dbengine/datafile-1-0000003777.ndf".
2023-04-11 09:46:08: netdata INFO  : MAIN : DBENGINE: reclaimed 7546472 bytes of disk space.

You have a low number of metrics overall (40k/3 ~= 13k metrics, validated by the log message to be exactly 12907).

Your tiering configuration is sufficient for long term storage of many years for that data.

The only way you can possibly not have data beyond 8 days ago is that the database was completely wiped 8 days ago. I suggest you make available the entire error.log for us to look at. I see that whatever happened, it was on April 1st, I hope you didn’t have a colleague playing a joke on you! :slight_smile:

ok, I will watch how the history develops.this is a bare metal server, I am solely responsible, so I exclude an april joke.
Somehow there must have been an accidental deletion, I agree.
But there was no *.nightly that could have triggered something like that?
Nothing that would be vital now, it’s just weird.
But, loog at this:
3rd April. something is strange

can I provide the whole error log gziped to an email wihout publishing it in the forum?

the history is still moving, only 7 days are left.
I would like to upload a 36K compressed log file, how can I do this?

Look at this:
x.log.gz

Related feature request.

@Bernd can you share http://127.0.0.1:19999/api/v1/dbengine_stats response?

sure:

{
	"tier0": {
		"default_granularity_secs":1,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":4,
		"sizeof_page_data":4096,
		"pages_per_extent":64,
		"datafiles":35,
		"extents":7667,
		"extents_pages":490687,
		"points":488409334,
		"metrics":440786,
		"metrics_pages":490687,
		"extents_compressed_bytes":167780659,
		"pages_uncompressed_bytes":1953637336,
		"pages_duration_secs":491254420,
		"single_point_pages":9,
		"first_t":1681274911,
		"last_t":1681389562,
		"database_retention_secs":114651,
		"average_compression_savings":91.41,
		"average_point_duration_secs":1.01,
		"average_metric_retention_secs":1114.50,
		"ephemeral_metrics_per_day_percent":7678.43,
		"average_page_size_bytes":3981.43,
		"estimated_concurrently_collected_metrics":4284,
		"currently_collected_metrics":13383,
		"disk_space":257861916,
		"max_disk_space":268435456
	},
	"tier1": {
		"default_granularity_secs":60,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":16,
		"sizeof_page_data":2048,
		"pages_per_extent":64,
		"datafiles":18,
		"extents":5880,
		"extents_pages":376256,
		"points":24234795,
		"metrics":211844,
		"metrics_pages":376256,
		"extents_compressed_bytes":56823441,
		"pages_uncompressed_bytes":387756720,
		"pages_duration_secs":1462351740,
		"single_point_pages":231,
		"first_t":1681101720,
		"last_t":1681379640,
		"database_retention_secs":277920,
		"average_compression_savings":85.35,
		"average_point_duration_secs":60.34,
		"average_metric_retention_secs":6902.97,
		"ephemeral_metrics_per_day_percent":1220.73,
		"average_page_size_bytes":1030.57,
		"estimated_concurrently_collected_metrics":5261,
		"currently_collected_metrics":13383,
		"disk_space":128530896,
		"max_disk_space":134217728
	},
	"tier2": {
		"default_granularity_secs":3600,
		"sizeof_datafile":288,
		"sizeof_page_in_cache":0,
		"sizeof_point_data":16,
		"sizeof_page_data":384,
		"pages_per_extent":64,
		"datafiles":8,
		"extents":4095,
		"extents_pages":261886,
		"points":2492465,
		"metrics":92096,
		"metrics_pages":261886,
		"extents_compressed_bytes":17603489,
		"pages_uncompressed_bytes":39879440,
		"pages_duration_secs":9007466400,
		"single_point_pages":19797,
		"first_t":1680660000,
		"last_t":1681362000,
		"database_retention_secs":702000,
		"average_compression_savings":55.86,
		"average_point_duration_secs":3613.88,
		"average_metric_retention_secs":97805.19,
		"ephemeral_metrics_per_day_percent":76.03,
		"average_page_size_bytes":152.28,
		"estimated_concurrently_collected_metrics":12831,
		"currently_collected_metrics":13383,
		"disk_space":59826908,
		"max_disk_space":67108864
	}
}

Removed not related to retention time

{
	"tier0": {
		"database_retention_secs":114651,
	},
	"tier1": {
		"database_retention_secs":277920,
	},
	"tier2": {
		"database_retention_secs":702000,
	}
}
  • tier0: 114651 secs => 1.3 days
  • tier1: 277920 secs => 3.2 days
  • tier2: 702000 secs => 8 days

For longer retention time increase the following values:

  • dbengine multihost disk space MB (tier0)
  • dbengine tier 1 multihost disk space MB (tier1)
  • dbengine tier 2 multihost disk space MB (tier2)

I was wrong to say it should store years, but the number of metrics shown on the chart does not match what this endpoint returns.

Based on the chart we saw, the calculation would be
64MB on disk / 4 bytes per point / 13000 metrics => 1.2k points per metric / 24 hr per day ~= 50 days

In reality, the metrics we should use are 92096, as the endpoint shows
64MB on disk / 4 bytes per point / 92096 metrics => 1.2k points per metric / 24 hr per day ~= 7.2 days

I’m updating the documentation, to instruct people to use that endpoint in the future.

Thank you both. Do you mind writing me now exactly where and what I have to enter.
What do I have to enter now if I want to have a considerably longer history available?
I can not map your advice to my netdata.conf.
The current config is

[db]
        dbengine multihost disk space MB = 256
        dbengine page cache size MB = 32
        storage tiers = 3
         dbengine parallel initialization = yes
         dbengine tier 1 multihost disk space MB = 128
         dbengine tier 1 update every iterations = 60
         dbengine tier 1 backfill = new
         dbengine tier 2 multihost disk space MB = 64
         dbengine tier 2 update every iterations = 60
         dbengine tier 2 backfill = new

As I stated earlier Space is no concern…

Change

  • dbengine multihost disk space MB from 256 to 4096
  • dbengine tier 1 multihost disk space MB from 128 to 2048
  • dbengine tier 2 multihost disk space MB from 64 to 1024

And restart netdata service

1 Like

Hi there - having similar type issue. I followed the instructions in here.

I have a parent running in a cloud docker container:

  • 6 child nodes on edge devices
  • Parent is running on 4 core, 8GB RAM, 100GB disk
  • Parent version: 1.45.3

Before update:

I then updated my netdata.conf

but looking at dbengine stats again, getting no meaningful increases

The service definition of my parent in compose looks like:

  netdata:
    image: netdata/netdata:stable
    container_name: netdata
    restart: unless-stopped
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined
    volumes:
      - netdataconfig:/etc/netdata
      - netdatalib:/var/lib/netdata
      - netdatacache:/var/cache/netdata
      - /etc/passwd:/host/etc/passwd:ro
      - /etc/group:/host/etc/group:ro
      - /etc/localtime:/etc/localtime:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
      - /var/log:/host/var/log:ro
      - ./stream.conf:/etc/netdata/stream.conf

Any thoughts / help appreciated

[ EDIT ]
This post may have been a bit premature. Hitting the endpoint now, the tier0 values are increasing. so maybe database_retention_secs is the current number of secs recorded? Is there a reference guide for the api/v1/dbengine_stats endpoint? I can’t seem to find the returned fields documented anywhere…so maybe I’m just not exactly sure what this is giving me.