Suggested template:
Problem/Question
I’m trying to setup a local-only monitoring stack with a central netdata collector, providing a web interface for itself and its children that stream metrics to it.
My children are frequently (and seemingly randomly) disappearing from the parent node. When they reappear, if they reappear, it’s often with only a few metrics.
Relevant docs you followed/actions you took to solve the issue
-
Installed netdata with the kickstart script
-
Configured TLS with a self-signed cert via documentation on the parent node
-
Configured streaming on parent node and children via documentation
-
After noticing issues, I began tweaking dbengine settings involving backfill and size to try and eliminate the problem.
Parent netdata.conf
[global]
run as user = netdata
# default storage size - increase for longer data retention
page cache size = 32
dbengine multihost disk space = 256
[web]
ssl key = /etc/netdata/ssl/key.pem
ssl certificate = /etc/netdata/ssl/cert.pem
[db]
replication threads = 3
mode = dbengine
# per second data collection
update every = 1
# number of tiers used (1 to 5, 3 being default)
storage tiers = 3
# Tier 0, per second data
dbengine multihost disk space MB = 4096
# Tier 1, per minute data
dbengine tier 1 multihost disk space MB = 8192
dbengine tier 1 update every iterations = 60
# Tier 2, per hour data
dbengine tier 2 multihost disk space MB = 8192
dbengine tier 2 update every iterations = 60
dbengine page cache size MB = 4096
dbengine extent cache size MB = 64
dbengine enable journal integrity check = yes
dbengine disk space MB = 8192
memory deduplication (ksm) = yes
# cleanup obsolete charts after secs = 3600
# gap when lost iterations above = 1
# enable replication = yes
# seconds to replicate = 86400
# seconds per replication step = 600
cleanup orphan hosts after secs = 36000000
# dbengine use direct io = yes
# dbengine pages per extent = 64
dbengine parallel initialization = yes
# dbengine tier 1 backfill = new
# dbengine tier 2 backfill = new
# delete obsolete charts files = yes
delete orphan hosts files = no
enable zero metrics = yes
Parent stream.conf
# -----------------------------------------------------------------------------
# 1. ON CHILD NETDATA - THE ONE THAT WILL BE SENDING METRICS
[stream]
# Enable this on child nodes, to have them send metrics.
enabled = no
# Where is the receiving netdata?
# A space separated list of:
#
# [PROTOCOL:]HOST[%INTERFACE][:PORT][:SSL]
#
# If many are given, the first available will get the metrics.
#
# PROTOCOL = tcp, udp, or unix (only tcp and unix are supported by parent nodes)
# HOST = an IPv4, IPv6 IP, or a hostname, or a unix domain socket path.
# IPv6 IPs should be given with brackets [ip:address]
# INTERFACE = the network interface to use (only for IPv6)
# PORT = the port number or service name (/etc/services)
# SSL = when this word appear at the end of the destination string
# the Netdata will encrypt the connection with the parent.
#
# This communication is not HTTP (it cannot be proxied by web proxies).
destination =
# Skip Certificate verification?
# The netdata child is configurated to avoid invalid SSL/TLS certificate,
# so certificates that are self-signed or expired will stop the streaming.
# Case the server certificate is not valid, you can enable the use of
# 'bad' certificates setting the next option as 'yes'.
#ssl skip certificate verification = yes
# Certificate Authority Path
# OpenSSL has a default directory where the known certificates are stored.
# In case it is necessary, it is possible to change this rule using the variable
# "CApath", e.g. CApath = /etc/ssl/certs/
#
#CApath =
# Certificate Authority file
# When the Netdata parent has a certificate that is not recognized as valid,
# we can add it to the list of known certificates in "CApath" and give it to
# Netdata as an argument, e.g. CAfile = /etc/ssl/certs/cert.pem
#
#CAfile =
# The API_KEY to use (as the sender)
api key =
# Stream Compression
# The default is enabled
# You can control stream compression in this agent with options: yes | no
#enable compression = yes
# The timeout to connect and send metrics
timeout seconds = 60
# If the destination line above does not specify a port, use this
default port = 19999
# filter the charts to be streamed
# netdata SIMPLE PATTERN:
# - space separated list of patterns (use \ to include spaces in patterns)
# - use * as wildcard, any number of times within each pattern
# - prefix a pattern with ! for a negative match (ie not stream the charts it matches)
# - the order of patterns is important (left to right)
# To send all except a few, use: !this !that * (ie append a wildcard pattern)
send charts matching = *
# The buffer to use for sending metrics.
# 10MB is good for 60 seconds of data, so increase this if you expect latencies.
# The buffer is flushed on reconnects (this will not prevent gaps at the charts).
buffer size bytes = 10485760
# If the connection fails, or it disconnects,
# retry after that many seconds.
reconnect delay seconds = 5
# Sync the clock of the charts for that many iterations, when starting.
# It is ignored when replication is enabled
initial clock resync iterations = 60
# -----------------------------------------------------------------------------
# 2. ON PARENT NETDATA - THE ONE THAT WILL BE RECEIVING METRICS
# You can have one API key per child,
# or the same API key for all child nodes.
#
# netdata searches for options in this order:
#
# a) parent netdata settings (netdata.conf)
# b) [stream] section (above)
# c) [API_KEY] section (below, settings for the API key)
# d) [MACHINE_GUID] section (below, settings for each machine)
#
# You can combine the above (the more specific setting will be used).
# API key authentication
# If the key is not listed here, it will not be able to push metrics.
# [API_KEY] is [YOUR-API-KEY], i.e [11111111-2222-3333-4444-555555555555]
[REDACTED]
# Default settings for this API key
# This GUID is to be used as an API key from remote agents connecting
# to this machine. Failure to match such a key, denies access.
# YOU MUST SET THIS FIELD ON ALL API KEYS.
type = api
# You can disable the API key, by setting this to: no
# The default (for unknown API keys) is: no
enabled = yes
# A list of simple patterns matching the IPs of the servers that
# will be pushing metrics using this API key.
# The metrics are received via the API port, so the same IPs
# should also be matched at netdata.conf [web].allow connections from
allow from = *
# The default history in entries, for all hosts using this API key.
# You can also set it per host below.
# For the default db mode (dbengine), this is ignored.
#default history = 3600
# The default memory mode to be used for all hosts using this API key.
# You can also set it per host below.
# If you don't set it here, the memory mode of netdata.conf will be used.
# Valid modes:
# save save on exit, load on start
# map like swap (continuously syncing to disks - you need SSD)
# ram keep it in RAM, don't touch the disk
# none no database at all (use this on headless proxies)
# dbengine like a traditional database
default memory mode = dbengine
# Shall we enable health monitoring for the hosts using this API key?
# 3 possible values:
# yes enable alarms
# no do not enable alarms
# auto enable alarms, only when the sending netdata is connected.
# Health monitoring will be disabled as soon as the connection is closed.
# You can also set it per host, below.
# The default is taken from [health].enabled of netdata.conf
health enabled by default = yes
# postpone alarms for a short period after the sender is connected
default postpone alarms on connect seconds = 60
# seconds of health log events to keep
default health log history = 604800
# need to route metrics differently? set these.
# the defaults are the ones at the [stream] section (above)
#default proxy enabled = yes | no
#default proxy destination = IP:PORT IP:PORT ...
#default proxy api key = API_KEY
#default proxy send charts matching = *
# Stream Compression
# By default it is enabled.
# You can control stream compression in this parent agent stream with options: yes | no
enable compression = yes
# Replication
# Enable replication for all hosts using this api key. Default: enabled
enable replication = yes
# How many seconds to replicate from each child. Default: a day
#seconds to replicate = 86400
# The duration we want to replicate per each step.
#replication_step = 600
# -----------------------------------------------------------------------------
# 3. PER SENDING HOST SETTINGS, ON PARENT NETDATA
# THIS IS OPTIONAL - YOU DON'T HAVE TO CONFIGURE IT
# This section exists to give you finer control of the parent settings for each
# child host, when the same API key is used by many netdata child nodes / proxies.
#
# Each netdata has a unique GUID - generated the first time netdata starts.
# You can find it at /var/lib/netdata/registry/netdata.public.unique.id
# (at the child).
#
# The host sending data will have one. If the host is not ephemeral,
# you can give settings for each sending host here.
[MACHINE_GUID]
# This GUID is to be used as a MACHINE GUID from remote agents connecting
# to this machine, not an API key.
# YOU MUST SET THIS FIELD ON ALL MACHINE GUIDs.
type = machine
# enable this host: yes | no
# When disabled, the parent will not receive metrics for this host.
# THIS IS NOT A SECURITY MECHANISM - AN ATTACKER CAN SET ANY OTHER GUID.
# Use only the API key for security.
enabled = no
# A list of simple patterns matching the IPs of the servers that
# will be pushing metrics using this MACHINE GUID.
# The metrics are received via the API port, so the same IPs
# should also be matched at netdata.conf [web].allow connections from
# and at stream.conf [API_KEY].allow from
allow from = *
# The number of entries in the database.
# This is ignored for db mode dbengine.
#history = 3600
# The memory mode of the database: save | map | ram | none | dbengine
#memory mode = dbengine
# Health / alarms control: yes | no | auto
#health enabled = auto
# postpone alarms when the sender connects
postpone alarms on connect seconds = 60
# seconds of health log events to keep
#health log history = 432000
# need to route metrics differently?
# the defaults are the ones at the [API KEY] section
#proxy enabled = yes | no
#proxy destination = IP:PORT IP:PORT ...
#proxy api key = API_KEY
#proxy send charts matching = *
# Stream Compression
# By default, enabled.
# You can control stream compression in this parent agent stream with options: yes | no
#enable compression = yes
# Replication
# Enable replication for all hosts using this api key.
#enable replication = yes
# How many seconds to replicate from each child.
#seconds to replicate = 86400
# The duration we want to replicate per each step.
#replication_step = 600
Child stream.conf
# -----------------------------------------------------------------------------
# 1. ON CHILD NETDATA - THE ONE THAT WILL BE SENDING METRICS
[stream]
# Enable this on child nodes, to have them send metrics.
enabled = yes
# Where is the receiving netdata?
# A space separated list of:
#
# [PROTOCOL:]HOST[%INTERFACE][:PORT][:SSL]
#
# If many are given, the first available will get the metrics.
#
# PROTOCOL = tcp, udp, or unix (only tcp and unix are supported by parent nodes)
# HOST = an IPv4, IPv6 IP, or a hostname, or a unix domain socket path.
# IPv6 IPs should be given with brackets [ip:address]
# INTERFACE = the network interface to use (only for IPv6)
# PORT = the port number or service name (/etc/services)
# SSL = when this word appear at the end of the destination string
# the Netdata will encrypt the connection with the parent.
#
# This communication is not HTTP (it cannot be proxied by web proxies).
destination = tcp:REDACTED:19999:SSL
# Skip Certificate verification?
# The netdata child is configurated to avoid invalid SSL/TLS certificate,
# so certificates that are self-signed or expired will stop the streaming.
# Case the server certificate is not valid, you can enable the use of
# 'bad' certificates setting the next option as 'yes'.
ssl skip certificate verification = yes
# Certificate Authority Path
# OpenSSL has a default directory where the known certificates are stored.
# In case it is necessary, it is possible to change this rule using the variable
# "CApath", e.g. CApath = /etc/ssl/certs/
#
#CApath =
# Certificate Authority file
# When the Netdata parent has a certificate that is not recognized as valid,
# we can add it to the list of known certificates in "CApath" and give it to
# Netdata as an argument, e.g. CAfile = /etc/ssl/certs/cert.pem
#
#CAfile =
# The API_KEY to use (as the sender)
api key = REDACTED
# Stream Compression
# The default is enabled
# You can control stream compression in this agent with options: yes | no
enable compression = yes
# The timeout to connect and send metrics
timeout seconds = 60
# If the destination line above does not specify a port, use this
default port = 19999
# filter the charts to be streamed
# netdata SIMPLE PATTERN:
# - space separated list of patterns (use \ to include spaces in patterns)
# - use * as wildcard, any number of times within each pattern
# - prefix a pattern with ! for a negative match (ie not stream the charts it matches)
# - the order of patterns is important (left to right)
# To send all except a few, use: !this !that * (ie append a wildcard pattern)
send charts matching = *
# The buffer to use for sending metrics.
# 10MB is good for 60 seconds of data, so increase this if you expect latencies.
# The buffer is flushed on reconnects (this will not prevent gaps at the charts).
buffer size bytes = 52428800
# If the connection fails, or it disconnects,
# retry after that many seconds.
reconnect delay seconds = 3
# Sync the clock of the charts for that many iterations, when starting.
# It is ignored when replication is enabled
initial clock resync iterations = 60
# -----------------------------------------------------------------------------
# 2. ON PARENT NETDATA - THE ONE THAT WILL BE RECEIVING METRICS
# You can have one API key per child,
# or the same API key for all child nodes.
#
# netdata searches for options in this order:
#
# a) parent netdata settings (netdata.conf)
# b) [stream] section (above)
# c) [API_KEY] section (below, settings for the API key)
# d) [MACHINE_GUID] section (below, settings for each machine)
#
# You can combine the above (the more specific setting will be used).
# API key authentication
# If the key is not listed here, it will not be able to push metrics.
# [API_KEY] is [YOUR-API-KEY], i.e [11111111-2222-3333-4444-555555555555]
[API_KEY]
# Default settings for this API key
# This GUID is to be used as an API key from remote agents connecting
# to this machine. Failure to match such a key, denies access.
# YOU MUST SET THIS FIELD ON ALL API KEYS.
type = api
# You can disable the API key, by setting this to: no
# The default (for unknown API keys) is: no
enabled = no
# A list of simple patterns matching the IPs of the servers that
# will be pushing metrics using this API key.
# The metrics are received via the API port, so the same IPs
# should also be matched at netdata.conf [web].allow connections from
allow from = *
# The default history in entries, for all hosts using this API key.
# You can also set it per host below.
# For the default db mode (dbengine), this is ignored.
#default history = 3600
# The default memory mode to be used for all hosts using this API key.
# You can also set it per host below.
# If you don't set it here, the memory mode of netdata.conf will be used.
# Valid modes:
# save save on exit, load on start
# map like swap (continuously syncing to disks - you need SSD)
# ram keep it in RAM, don't touch the disk
# none no database at all (use this on headless proxies)
# dbengine like a traditional database
#default memory mode = dbengine
# Shall we enable health monitoring for the hosts using this API key?
# 3 possible values:
# yes enable alarms
# no do not enable alarms
# auto enable alarms, only when the sending netdata is connected.
# Health monitoring will be disabled as soon as the connection is closed.
# You can also set it per host, below.
# The default is taken from [health].enabled of netdata.conf
#health enabled by default = auto
# postpone alarms for a short period after the sender is connected
default postpone alarms on connect seconds = 60
# seconds of health log events to keep
#default health log history = 432000
# need to route metrics differently? set these.
# the defaults are the ones at the [stream] section (above)
#default proxy enabled = yes | no
#default proxy destination = IP:PORT IP:PORT ...
#default proxy api key = API_KEY
#default proxy send charts matching = *
# Stream Compression
# By default it is enabled.
# You can control stream compression in this parent agent stream with options: yes | no
#enable compression = yes
# Replication
# Enable replication for all hosts using this api key. Default: enabled
enable replication = yes
# How many seconds to replicate from each child. Default: a day
seconds to replicate = 86400
# The duration we want to replicate per each step.
replication_step = 600
# -----------------------------------------------------------------------------
# 3. PER SENDING HOST SETTINGS, ON PARENT NETDATA
# THIS IS OPTIONAL - YOU DON'T HAVE TO CONFIGURE IT
# This section exists to give you finer control of the parent settings for each
# child host, when the same API key is used by many netdata child nodes / proxies.
#
# Each netdata has a unique GUID - generated the first time netdata starts.
# You can find it at /var/lib/netdata/registry/netdata.public.unique.id
# (at the child).
#
# The host sending data will have one. If the host is not ephemeral,
# you can give settings for each sending host here.
[MACHINE_GUID]
# This GUID is to be used as a MACHINE GUID from remote agents connecting
# to this machine, not an API key.
# YOU MUST SET THIS FIELD ON ALL MACHINE GUIDs.
type = machine
# enable this host: yes | no
# When disabled, the parent will not receive metrics for this host.
# THIS IS NOT A SECURITY MECHANISM - AN ATTACKER CAN SET ANY OTHER GUID.
# Use only the API key for security.
enabled = no
# A list of simple patterns matching the IPs of the servers that
# will be pushing metrics using this MACHINE GUID.
# The metrics are received via the API port, so the same IPs
# should also be matched at netdata.conf [web].allow connections from
# and at stream.conf [API_KEY].allow from
allow from = *
# The number of entries in the database.
# This is ignored for db mode dbengine.
#history = 3600
# The memory mode of the database: save | map | ram | none | dbengine
#memory mode = dbengine
# Health / alarms control: yes | no | auto
#health enabled = auto
# postpone alarms when the sender connects
postpone alarms on connect seconds = 60
# seconds of health log events to keep
#health log history = 432000
# need to route metrics differently?
# the defaults are the ones at the [API KEY] section
#proxy enabled = yes | no
#proxy destination = IP:PORT IP:PORT ...
#proxy api key = API_KEY
#proxy send charts matching = *
# Stream Compression
# By default, enabled.
# You can control stream compression in this parent agent stream with options: yes | no
#enable compression = yes
# Replication
# Enable replication for all hosts using this api key.
#enable replication = yes
# How many seconds to replicate from each child.
#seconds to replicate = 86400
# The duration we want to replicate per each step.
#replication_step = 600
Environment/Browser/Agent’s version etc
- x86_64 Linux – primarily Alma 9, Debian, and Proxmox – all fully updated
- netdata v1.42.0-217-nightly on all nodes, parent and children
What I expected to happen
All metrics are streamed consistently to the parent