Streaming and retaining long term metrics

Hi team,

I have one server acting as master and other 4 servers as child nodes.
On the master node, I have configured the dbengine (as per the calculation) to retain the metrics for 90 days.

[global]
run as user = netdata
dbengine multihost disk space = 284766
# the default database size - 1 hour
history = 3600
update every = 5
apps = no
cgroups = no

But I can see that the metrics are not retained for more that 4 days on an avg.
let me know what logs are required for further troubleshooting this issue.
Below is the result for error.log | grep -i stream| grep -v -i info

2022-06-22 09:55:03: netdata ERROR : STREAM_RECEIVER[london,[192.168.1.18]:56572] : STREAM london [receive from [192.168.1.18]:56572]: disconnected (completed 31384 updates).
2022-06-22 09:55:04: netdata ERROR : STREAM_RECEIVER[miami,[192.168.1.90]:40140] : STREAM miami [receive from [192.168.1.90]:40140]: disconnected (completed 27563 updates).
2022-06-22 09:55:04: netdata ERROR : STREAM_RECEIVER[hollywood,[192.168.1.30]:59062] : STREAM hollywood [receive from [192.168.1.30]:59062]: disconnected (completed 1906 updates).
2022-06-22 09:55:05: netdata ERROR : STREAM_RECEIVER[vegas,[192.168.1.27]:45496] : STREAM vegas [receive from [192.168.1.27]:45496]: disconnected (completed 37149 updates).
2022-06-22 09:55:07: go.d ERROR: prometheus[stream_exporter_local] Get "http://127.0.0.1:9178/metrics": dial tcp 127.0.0.1:9178: connect: connection refused
2022-06-22 09:55:07: go.d ERROR: prometheus[stream_exporter_local] check failed
2022-06-22 09:55:07: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] Get "http://127.0.0.1:9513/metrics": dial tcp 127.0.0.1:9513: connect: connection refused
2022-06-22 09:55:07: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] check failed
2022-06-22 10:49:59: netdata ERROR : STREAM_RECEIVER[miami,[192.168.1.90]:40150] : STREAM miami [receive from [192.168.1.90]:40150]: disconnected (completed 214152 updates).
2022-06-22 10:49:59: netdata ERROR : STREAM_RECEIVER[hollywood,[192.168.1.30]:59068] : STREAM hollywood [receive from [192.168.1.30]:59068]: disconnected (completed 222798 updates).
2022-06-22 10:50:00: netdata ERROR : STREAM_RECEIVER[vegas,[192.168.1.27]:45696] : STREAM vegas [receive from [192.168.1.27]:45696]: disconnected (completed 281981 updates).
2022-06-22 10:50:00: netdata ERROR : STREAM_RECEIVER[london,[192.168.1.18]:56588] : STREAM london [receive from [192.168.1.18]:56588]: disconnected (completed 249167 updates).
2022-06-22 10:50:01: go.d ERROR: prometheus[stream_exporter_local] Get "http://127.0.0.1:9178/metrics": dial tcp 127.0.0.1:9178: connect: connection refused
2022-06-22 10:50:01: go.d ERROR: prometheus[stream_exporter_local] check failed
2022-06-22 10:50:01: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] Get "http://127.0.0.1:9513/metrics": dial tcp 127.0.0.1:9513: connect: connection refused
2022-06-22 10:50:01: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] check failed
2022-06-22 10:59:14: netdata ERROR : STREAM_RECEIVER[hollywood,[192.168.1.30]:59072] : STREAM hollywood [receive from [192.168.1.30]:59072]: disconnected (completed 37234 updates).
2022-06-22 10:59:15: netdata ERROR : STREAM_RECEIVER[miami,[192.168.1.90]:41866] : STREAM miami [receive from [192.168.1.90]:41866]: disconnected (completed 40442 updates).
2022-06-22 10:59:15: netdata ERROR : STREAM_RECEIVER[london,[192.168.1.18]:58296] : STREAM london [receive from [192.168.1.18]:58296]: disconnected (completed 45954 updates).
2022-06-22 10:59:15: netdata ERROR : STREAM_RECEIVER[vegas,[192.168.1.27]:47212] : STREAM vegas [receive from [192.168.1.27]:47212]: disconnected (completed 54532 updates).
2022-06-22 10:59:37: go.d ERROR: prometheus[stream_exporter_local] Get "http://127.0.0.1:9178/metrics": dial tcp 127.0.0.1:9178: connect: connection refused
2022-06-22 10:59:37: go.d ERROR: prometheus[stream_exporter_local] check failed
2022-06-22 10:59:37: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] Get "http://127.0.0.1:9513/metrics": dial tcp 127.0.0.1:9513: connect: connection refused
2022-06-22 10:59:37: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] check failed
2022-06-22 11:03:18: netdata ERROR : STREAM_RECEIVER[london,[192.168.1.18]:58302] : STREAM london [receive from [192.168.1.18]:58302]: disconnected (completed 19981 updates).
2022-06-22 11:03:19: netdata ERROR : STREAM_RECEIVER[miami,[192.168.1.90]:41870] : STREAM miami [receive from [192.168.1.90]:41870]: disconnected (completed 17537 updates).
2022-06-22 11:03:19: netdata ERROR : STREAM_RECEIVER[hollywood,[192.168.1.30]:59076] : STREAM hollywood [receive from [192.168.1.30]:59076]: disconnected (completed 16240 updates).
2022-06-22 11:03:20: netdata ERROR : STREAM_RECEIVER[vegas,[192.168.1.27]:47216] : STREAM vegas [receive from [192.168.1.27]:47216]: disconnected (completed 23742 updates).
2022-06-22 11:03:21: go.d ERROR: prometheus[stream_exporter_local] Get "http://127.0.0.1:9178/metrics": dial tcp 127.0.0.1:9178: connect: connection refused
2022-06-22 11:03:21: go.d ERROR: prometheus[stream_exporter_local] check failed
2022-06-22 11:03:21: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] Get "http://127.0.0.1:9513/metrics": dial tcp 127.0.0.1:9513: connect: connection refused
2022-06-22 11:03:21: go.d ERROR: prometheus[openconfig_streaming_telemetry_exporter_local] check failed

root@mediapc:/etc/netdata# cat /etc/netdata/stream.conf | grep -v "    #"  | grep -v ^#

[stream]
    enabled = no
    destination =
    api key =
    timeout seconds = 60
    default port = 19999
    send charts matching = *
    buffer size bytes = 10485760
    reconnect delay seconds = 5
    initial clock resync iterations = 60

[xxxxx]
    enabled = yes
    allow from = *
    default history = 3600
    default memory mode = dbengine
    health enabled by default = auto
    default postpone alarms on connect seconds = 60

[MACHINE_GUID]
    enabled = no
    allow from = *
    history = 3600
    memory mode = save
    health enabled = yes
    postpone alarms on connect seconds = 60


root@mediapc:/etc/netdata# netdata -W buildinfo
Version: netdata v1.35.0-54-nightly
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
    Binary architecture: x86_64
    Packaging distro:  
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES



root@mediapc:/etc/netdata# netdatacli aclk-state
ACLK Available: Yes
ACLK Version: 2
Protocols Supported: Legacy, Protobuf
Protocol Used: Protobuf
MQTT Version: 3
Claimed: Yes
Claimed Id: xxxx
Cloud URL: https://app.netdata.cloud
Online: Yes
Reconnect count: 1
Banned By Cloud: No
Last Connection Time: 2022-06-22 10:38:22
Last Connection Time + 3 PUBACKs received: 2022-06-22 10:38:23
Last Disconnect Time: 2022-06-22 10:34:28
Received Cloud MQTT Messages: 16
MQTT Messages Confirmed by Remote Broker (PUBACKs): 11

> Node Instance for mGUID: "xxx hostname "mediapc"
	Claimed ID: xxxx
	Node ID: xxxxxxx
	Streaming Hops: 0
	Relationship: self
	Alert Streaming Status:
		Updates: 1
		Batch ID: 46
		Last Acked Seq ID: 839
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Last Submitted Seq ID: 839
	Chart Streaming Status:
		Updates: 1
		Batch ID: 79
		Min Seq ID: 1
		Max Seq ID: 3766
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Sent Min Seq ID: 1
		Sent Max Seq ID: 3766
		Acked Min Seq ID: 1
		Acked Max Seq ID: 3766

> Node Instance for mGUID: "xxx hostname "vegas"
	Claimed ID: xxxx
	Node ID: xxxxxxx
	Streaming Hops: 1
	Relationship: child
	Streaming Connection Live: true
	Alert Streaming Status:
		Updates: 1
		Batch ID: 119
		Last Acked Seq ID: 4786
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Last Submitted Seq ID: 4786
	Chart Streaming Status:
		Updates: 1
		Batch ID: 23
		Min Seq ID: 3470
		Max Seq ID: 3753
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Sent Min Seq ID: 3470
		Sent Max Seq ID: 3753
		Acked Min Seq ID: 3470
		Acked Max Seq ID: 3753

> Node Instance for mGUID: "xxx hostname "hollywood"
	Claimed ID: xxxx
	Node ID: xxxxxxx
	Streaming Hops: 1
	Relationship: child
	Streaming Connection Live: true
	Alert Streaming Status:
		Updates: 1
		Batch ID: 20
		Last Acked Seq ID: 919
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Last Submitted Seq ID: 919
	Chart Streaming Status:
		Updates: 1
		Batch ID: 42
		Min Seq ID: 9923
		Max Seq ID: 10841
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Sent Min Seq ID: 9923
		Sent Max Seq ID: 10841
		Acked Min Seq ID: 9923
		Acked Max Seq ID: 10841

> Node Instance for mGUID: "xxx hostname "miami"
	Claimed ID: xxxx
	Node ID: xxxxxxx
	Streaming Hops: 1
	Relationship: child
	Streaming Connection Live: true
	Alert Streaming Status:
		Updates: 1
		Batch ID: 28
		Last Acked Seq ID: 1076
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Last Submitted Seq ID: 1076
	Chart Streaming Status:
		Updates: 1
		Batch ID: 27
		Min Seq ID: 4537
		Max Seq ID: 7462
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Sent Min Seq ID: 4537
		Sent Max Seq ID: 7462
		Acked Min Seq ID: 4537
		Acked Max Seq ID: 7462

> Node Instance for mGUID: "xxx hostname "london"
	Claimed ID: xxxx
	Node ID: xxxxxxx
	Streaming Hops: 1
	Relationship: child
	Streaming Connection Live: true
	Alert Streaming Status:
		Updates: 1
		Batch ID: 28
		Last Acked Seq ID: 1180
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Last Submitted Seq ID: 1180
	Chart Streaming Status:
		Updates: 1
		Batch ID: 49
		Min Seq ID: 3425
		Max Seq ID: 3955
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Sent Min Seq ID: 3425
		Sent Max Seq ID: 3955
		Acked Min Seq ID: 3425
		Acked Max Seq ID: 3955
root@mediapc:/etc/netdata# cat netdata.conf
# netdata configuration
#
# You can download the latest version of this file, using:
#
#  wget -O /etc/netdata/netdata.conf http://localhost:19999/netdata.conf
# or
#  curl -o /etc/netdata/netdata.conf http://localhost:19999/netdata.conf
#
# You can uncomment and change any of the options below.
# The value shown in the commented settings, is the default value.
#

[global]
    run as user = netdata
    dbengine multihost disk space = 284766
    # the default database size - 1 hour
    history = 3600
    update every = 5
    apps = no
    cgroups = no

[directories]
    cache = /netdata/cache
    log = /netdata/logs

[plugins]
    proc = yes
    python.d = yes
    charts.d = yes
    go.d = yes
    idlejitter = no
    ebpf = no
    timex = no

[logs]
    debug log = none
    error log = none
    access log = none

[plugin:proc]
   /proc/softirqs = no
   /proc/interrupts = no
   /proc/pressure = no
   /proc/sys/kernel/random/entropy_avail = no
   /proc/net/snmp6 = no
   /proc/net/softnet_sta = no
   /proc/spl/kstat/zfs/arcstats = no
   /proc/spl/kstat/zfs/pool/state = no
   /proc/net/rpc/nfsd = no
   /proc/net/softnet_stat = no
   /proc/net/sockstat6 = no
   /proc/net/sockstat = no
   /proc/net/ip_vs/stats = no
   /proc/net/stat/conntrack = no
   /proc/net/stat/synproxy = no
   ipc = no
   /sys/class/infiniband = no

[plugin:proc:/proc/meminfo]
   slab memory = no

[plugin:proc:diskspace]
     check for new mount points every = 15
     exclude space metrics on paths = /boot/* /run/* /dev/* /proc/* /sys/* /var/run/user/* /run/user/* /snap/* /var/lib/docker/*

[plugin:timex]
    clock synchronization state = no
    time offset = no
    
[plugin:proc:/proc/net/sockstat6]
	ipv6 TCP sockets = no
	ipv6 UDP sockets = no
	ipv6 UDPLITE sockets = no
	ipv6 RAW sockets = no
	ipv6 FRAG sockets = no

[plugin:proc:/proc/net/snmp6]
	ipv6 packets = no
	ipv6 fragments sent = no
	ipv6 fragments assembly = no
	ipv6 errors = no
	ipv6 UDP packets = no
	ipv6 UDP errors = no
	ipv6 UDPlite packets = no
	ipv6 UDPlite errors = no
	bandwidth = no
	multicast bandwidth = no
	broadcast bandwidth = no
	multicast packets = no
	icmp = no
	icmp redirects = no
	icmp errors = no
	icmp echos = no
	icmp group membership = no
	icmp router = no
	icmp neighbor = no
	icmp mldv2 = no
	icmp types = no
	ect = no

Stream config

root@mediapc:/etc/netdata# cat /etc/netdata/stream.conf | grep -v "    #"  | grep -v ^#

[stream]
    enabled = no
    destination =
    api key =
    timeout seconds = 60
    default port = 19999
    send charts matching = *
    buffer size bytes = 10485760
    reconnect delay seconds = 5
    initial clock resync iterations = 60

[xxxxx]
    enabled = yes
    allow from = *
    default history = 3600
    default memory mode = dbengine
    health enabled by default = auto
    default postpone alarms on connect seconds = 60

[MACHINE_GUID]
    enabled = no
    allow from = *
    history = 3600
    memory mode = save
    health enabled = yes
    postpone alarms on connect seconds = 60

How many metrics does each child collect? Same for the parent.

Is the ‘update every’ 5 on each child as well?

The calculator is at Change how long Netdata stores metrics | Learn Netdata and is fairly accurate. It says where you can see the metrics collected on each child too

Hi Cristopher,

Firstly thank you for your prompt response, highly appreciate it.
Yes, the ‘update every’ is set to 5 on all child as well.
PFA screenshot of the dashboard.

It looks like a bug, unless I am missing something. You should only be needing on the parent the following to get 90 days retention.

[global]    

dbengine multihost disk space = 44495

You have more. Will try to get more help to debug here

1 Like

Thank you @Christopher_Akritid1 for looking into this.
Kindly keep me posted on this. Let me know if you need any logs for troubleshooting this issue further.

Hi @Vicky_Ingle ! Can you please share a full recent error.log to manolis at netdata dot cloud? Thanks!

Kindly let me know how shall I share the logs with you. The upload option in the forum only allows me to upload pictures.

Please email the logs to the email Manolis mentions above. Thanks.

I have emailed him the relevant logs and information