Children disappearing and reappearing in parent node

Suggested template:

Problem/Question

I’m trying to setup a local-only monitoring stack with a central netdata collector, providing a web interface for itself and its children that stream metrics to it.

My children are frequently (and seemingly randomly) disappearing from the parent node. When they reappear, if they reappear, it’s often with only a few metrics.

Relevant docs you followed/actions you took to solve the issue

  • Installed netdata with the kickstart script

  • Configured TLS with a self-signed cert via documentation on the parent node

  • Configured streaming on parent node and children via documentation

  • After noticing issues, I began tweaking dbengine settings involving backfill and size to try and eliminate the problem.

Parent netdata.conf

[global]
    run as user = netdata

    # default storage size - increase for longer data retention
    page cache size = 32
    dbengine multihost disk space = 256

[web]
    ssl key = /etc/netdata/ssl/key.pem
    ssl certificate = /etc/netdata/ssl/cert.pem

[db]
    replication threads = 3

    mode = dbengine
    
    # per second data collection
    update every = 1
    
    # number of tiers used (1 to 5, 3 being default)
    storage tiers = 3
    
    # Tier 0, per second data
    dbengine multihost disk space MB = 4096
    
    # Tier 1, per minute data
    dbengine tier 1 multihost disk space MB = 8192
    dbengine tier 1 update every iterations = 60
    
    # Tier 2, per hour data
    dbengine tier 2 multihost disk space MB = 8192
    dbengine tier 2 update every iterations = 60

    dbengine page cache size MB = 4096

    dbengine extent cache size MB = 64
    dbengine enable journal integrity check = yes
    dbengine disk space MB = 8192
    memory deduplication (ksm) = yes
    # cleanup obsolete charts after secs = 3600
    # gap when lost iterations above = 1
    # enable replication = yes
    # seconds to replicate = 86400
    # seconds per replication step = 600
    cleanup orphan hosts after secs = 36000000
    # dbengine use direct io = yes
    # dbengine pages per extent = 64
    dbengine parallel initialization = yes
    # dbengine tier 1 backfill = new
    # dbengine tier 2 backfill = new
    # delete obsolete charts files = yes
    delete orphan hosts files = no
    enable zero metrics = yes

Parent stream.conf

# -----------------------------------------------------------------------------
# 1. ON CHILD NETDATA - THE ONE THAT WILL BE SENDING METRICS

[stream]
    # Enable this on child nodes, to have them send metrics.
    enabled = no

    # Where is the receiving netdata?
    # A space separated list of:
    #
    #      [PROTOCOL:]HOST[%INTERFACE][:PORT][:SSL]
    #
    # If many are given, the first available will get the metrics.
    #
    # PROTOCOL  = tcp, udp, or unix (only tcp and unix are supported by parent nodes)
    # HOST      = an IPv4, IPv6 IP, or a hostname, or a unix domain socket path.
    #             IPv6 IPs should be given with brackets [ip:address]
    # INTERFACE = the network interface to use (only for IPv6)
    # PORT      = the port number or service name (/etc/services)
    # SSL       = when this word appear at the end of the destination string
    #             the Netdata will encrypt the connection with the parent.
    #
    # This communication is not HTTP (it cannot be proxied by web proxies).
    destination =

    # Skip Certificate verification?
    # The netdata child is configurated to avoid invalid SSL/TLS certificate,
    # so certificates that are self-signed or expired will stop the streaming.
    # Case the server certificate is not valid, you can enable the use of
    # 'bad' certificates setting the next option as 'yes'.
    #ssl skip certificate verification = yes

    # Certificate Authority Path
    # OpenSSL has a default directory where the known certificates are stored.
    # In case it is necessary, it is possible to change this rule using the variable
    # "CApath", e.g. CApath = /etc/ssl/certs/
    #
    #CApath =

    # Certificate Authority file
    # When the Netdata parent has a certificate that is not recognized as valid,
    # we can add it to the list of known certificates in "CApath" and give it to
    # Netdata as an argument, e.g. CAfile = /etc/ssl/certs/cert.pem
    #
    #CAfile =

    # The API_KEY to use (as the sender)
    api key =

    # Stream Compression
    # The default is enabled
    # You can control stream compression in this agent with options: yes | no
    #enable compression = yes

    # The timeout to connect and send metrics
    timeout seconds = 60

    # If the destination line above does not specify a port, use this
    default port = 19999

    # filter the charts to be streamed
    # netdata SIMPLE PATTERN:
    # - space separated list of patterns (use \ to include spaces in patterns)
    # - use * as wildcard, any number of times within each pattern
    # - prefix a pattern with ! for a negative match (ie not stream the charts it matches)
    # - the order of patterns is important (left to right)
    # To send all except a few, use: !this !that *   (ie append a wildcard pattern)
    send charts matching = *

    # The buffer to use for sending metrics.
    # 10MB is good for 60 seconds of data, so increase this if you expect latencies.
    # The buffer is flushed on reconnects (this will not prevent gaps at the charts).
    buffer size bytes = 10485760

    # If the connection fails, or it disconnects,
    # retry after that many seconds.
    reconnect delay seconds = 5

    # Sync the clock of the charts for that many iterations, when starting.
    # It is ignored when replication is enabled
    initial clock resync iterations = 60

# -----------------------------------------------------------------------------
# 2. ON PARENT NETDATA - THE ONE THAT WILL BE RECEIVING METRICS

#    You can have one API key per child,
#         or the same API key for all child nodes.
#
#    netdata searches for options in this order:
#
#    a) parent netdata settings (netdata.conf)
#    b) [stream] section        (above)
#    c) [API_KEY] section       (below, settings for the API key)
#    d) [MACHINE_GUID] section  (below, settings for each machine)
#
#    You can combine the above (the more specific setting will be used).

# API key authentication
# If the key is not listed here, it will not be able to push metrics.

# [API_KEY] is [YOUR-API-KEY], i.e [11111111-2222-3333-4444-555555555555]
[REDACTED]
    # Default settings for this API key

    # This GUID is to be used as an API key from remote agents connecting
    # to this machine. Failure to match such a key, denies access.
    # YOU MUST SET THIS FIELD ON ALL API KEYS.
    type = api

    # You can disable the API key, by setting this to: no
    # The default (for unknown API keys) is: no
    enabled = yes

    # A list of simple patterns matching the IPs of the servers that
    # will be pushing metrics using this API key.
    # The metrics are received via the API port, so the same IPs
    # should also be matched at netdata.conf [web].allow connections from
    allow from = *

    # The default history in entries, for all hosts using this API key.
    # You can also set it per host below.
    # For the default db mode (dbengine), this is ignored.
    #default history = 3600

    # The default memory mode to be used for all hosts using this API key.
    # You can also set it per host below.
    # If you don't set it here, the memory mode of netdata.conf will be used.
    # Valid modes:
    #    save     save on exit, load on start
    #    map      like swap (continuously syncing to disks - you need SSD)
    #    ram      keep it in RAM, don't touch the disk
    #    none     no database at all (use this on headless proxies)
    #    dbengine like a traditional database
    default memory mode = dbengine

    # Shall we enable health monitoring for the hosts using this API key?
    # 3 possible values:
    #    yes     enable alarms
    #    no      do not enable alarms
    #    auto    enable alarms, only when the sending netdata is connected.
    #            Health monitoring will be disabled as soon as the connection is closed.
    # You can also set it per host, below.
    # The default is taken from [health].enabled of netdata.conf
    health enabled by default = yes

    # postpone alarms for a short period after the sender is connected
    default postpone alarms on connect seconds = 60

    # seconds of health log events to keep
    default health log history = 604800

    # need to route metrics differently? set these.
    # the defaults are the ones at the [stream] section (above)
    #default proxy enabled = yes | no
    #default proxy destination = IP:PORT IP:PORT ...
    #default proxy api key = API_KEY
    #default proxy send charts matching = *

    # Stream Compression
    # By default it is enabled.
    # You can control stream compression in this parent agent stream with options: yes | no
    enable compression = yes

    # Replication
    # Enable replication for all hosts using this api key. Default: enabled
    enable replication = yes

    # How many seconds to replicate from each child. Default: a day
    #seconds to replicate = 86400

    # The duration we want to replicate per each step.
    #replication_step = 600

# -----------------------------------------------------------------------------
# 3. PER SENDING HOST SETTINGS, ON PARENT NETDATA
#    THIS IS OPTIONAL - YOU DON'T HAVE TO CONFIGURE IT

# This section exists to give you finer control of the parent settings for each
# child host, when the same API key is used by many netdata child nodes / proxies.
#
# Each netdata has a unique GUID - generated the first time netdata starts.
# You can find it at /var/lib/netdata/registry/netdata.public.unique.id
# (at the child).
#
# The host sending data will have one. If the host is not ephemeral,
# you can give settings for each sending host here.

[MACHINE_GUID]
    # This GUID is to be used as a MACHINE GUID from remote agents connecting
    # to this machine, not an API key.
    # YOU MUST SET THIS FIELD ON ALL MACHINE GUIDs.
    type = machine

    # enable this host: yes | no
    # When disabled, the parent will not receive metrics for this host.
    # THIS IS NOT A SECURITY MECHANISM - AN ATTACKER CAN SET ANY OTHER GUID.
    # Use only the API key for security.
    enabled = no

    # A list of simple patterns matching the IPs of the servers that
    # will be pushing metrics using this MACHINE GUID.
    # The metrics are received via the API port, so the same IPs
    # should also be matched at netdata.conf [web].allow connections from
    # and at stream.conf [API_KEY].allow from
    allow from = *

    # The number of entries in the database.
    # This is ignored for db mode dbengine.
    #history = 3600

    # The memory mode of the database: save | map | ram | none | dbengine
    #memory mode = dbengine

    # Health / alarms control: yes | no | auto
    #health enabled = auto

    # postpone alarms when the sender connects
    postpone alarms on connect seconds = 60

    # seconds of health log events to keep
    #health log history = 432000

    # need to route metrics differently?
    # the defaults are the ones at the [API KEY] section
    #proxy enabled = yes | no
    #proxy destination = IP:PORT IP:PORT ...
    #proxy api key = API_KEY
    #proxy send charts matching = *

    # Stream Compression
    # By default, enabled.
    # You can control stream compression in this parent agent stream with options: yes | no
    #enable compression = yes

    # Replication
    # Enable replication for all hosts using this api key.
    #enable replication = yes

    # How many seconds to replicate from each child.
    #seconds to replicate = 86400

    # The duration we want to replicate per each step.
    #replication_step = 600

Child stream.conf

# -----------------------------------------------------------------------------
# 1. ON CHILD NETDATA - THE ONE THAT WILL BE SENDING METRICS

[stream]
    # Enable this on child nodes, to have them send metrics.
    enabled = yes

    # Where is the receiving netdata?
    # A space separated list of:
    #
    #      [PROTOCOL:]HOST[%INTERFACE][:PORT][:SSL]
    #
    # If many are given, the first available will get the metrics.
    #
    # PROTOCOL  = tcp, udp, or unix (only tcp and unix are supported by parent nodes)
    # HOST      = an IPv4, IPv6 IP, or a hostname, or a unix domain socket path.
    #             IPv6 IPs should be given with brackets [ip:address]
    # INTERFACE = the network interface to use (only for IPv6)
    # PORT      = the port number or service name (/etc/services)
    # SSL       = when this word appear at the end of the destination string
    #             the Netdata will encrypt the connection with the parent.
    #
    # This communication is not HTTP (it cannot be proxied by web proxies).
    destination = tcp:REDACTED:19999:SSL

    # Skip Certificate verification?
    # The netdata child is configurated to avoid invalid SSL/TLS certificate,
    # so certificates that are self-signed or expired will stop the streaming.
    # Case the server certificate is not valid, you can enable the use of
    # 'bad' certificates setting the next option as 'yes'.
    ssl skip certificate verification = yes

    # Certificate Authority Path
    # OpenSSL has a default directory where the known certificates are stored.
    # In case it is necessary, it is possible to change this rule using the variable
    # "CApath", e.g. CApath = /etc/ssl/certs/
    #
    #CApath =

    # Certificate Authority file
    # When the Netdata parent has a certificate that is not recognized as valid,
    # we can add it to the list of known certificates in "CApath" and give it to
    # Netdata as an argument, e.g. CAfile = /etc/ssl/certs/cert.pem
    #
    #CAfile =

    # The API_KEY to use (as the sender)
    api key = REDACTED

    # Stream Compression
    # The default is enabled
    # You can control stream compression in this agent with options: yes | no
    enable compression = yes

    # The timeout to connect and send metrics
    timeout seconds = 60

    # If the destination line above does not specify a port, use this
    default port = 19999

    # filter the charts to be streamed
    # netdata SIMPLE PATTERN:
    # - space separated list of patterns (use \ to include spaces in patterns)
    # - use * as wildcard, any number of times within each pattern
    # - prefix a pattern with ! for a negative match (ie not stream the charts it matches)
    # - the order of patterns is important (left to right)
    # To send all except a few, use: !this !that *   (ie append a wildcard pattern)
    send charts matching = *

    # The buffer to use for sending metrics.
    # 10MB is good for 60 seconds of data, so increase this if you expect latencies.
    # The buffer is flushed on reconnects (this will not prevent gaps at the charts).
    buffer size bytes = 52428800

    # If the connection fails, or it disconnects,
    # retry after that many seconds.
    reconnect delay seconds = 3

    # Sync the clock of the charts for that many iterations, when starting.
    # It is ignored when replication is enabled
    initial clock resync iterations = 60

# -----------------------------------------------------------------------------
# 2. ON PARENT NETDATA - THE ONE THAT WILL BE RECEIVING METRICS

#    You can have one API key per child,
#         or the same API key for all child nodes.
#
#    netdata searches for options in this order:
#
#    a) parent netdata settings (netdata.conf)
#    b) [stream] section        (above)
#    c) [API_KEY] section       (below, settings for the API key)
#    d) [MACHINE_GUID] section  (below, settings for each machine)
#
#    You can combine the above (the more specific setting will be used).

# API key authentication
# If the key is not listed here, it will not be able to push metrics.

# [API_KEY] is [YOUR-API-KEY], i.e [11111111-2222-3333-4444-555555555555]
[API_KEY]
    # Default settings for this API key

    # This GUID is to be used as an API key from remote agents connecting
    # to this machine. Failure to match such a key, denies access.
    # YOU MUST SET THIS FIELD ON ALL API KEYS.
    type = api

    # You can disable the API key, by setting this to: no
    # The default (for unknown API keys) is: no
    enabled = no

    # A list of simple patterns matching the IPs of the servers that
    # will be pushing metrics using this API key.
    # The metrics are received via the API port, so the same IPs
    # should also be matched at netdata.conf [web].allow connections from
    allow from = *

    # The default history in entries, for all hosts using this API key.
    # You can also set it per host below.
    # For the default db mode (dbengine), this is ignored.
    #default history = 3600

    # The default memory mode to be used for all hosts using this API key.
    # You can also set it per host below.
    # If you don't set it here, the memory mode of netdata.conf will be used.
    # Valid modes:
    #    save     save on exit, load on start
    #    map      like swap (continuously syncing to disks - you need SSD)
    #    ram      keep it in RAM, don't touch the disk
    #    none     no database at all (use this on headless proxies)
    #    dbengine like a traditional database
    #default memory mode = dbengine

    # Shall we enable health monitoring for the hosts using this API key?
    # 3 possible values:
    #    yes     enable alarms
    #    no      do not enable alarms
    #    auto    enable alarms, only when the sending netdata is connected.
    #            Health monitoring will be disabled as soon as the connection is closed.
    # You can also set it per host, below.
    # The default is taken from [health].enabled of netdata.conf
    #health enabled by default = auto

    # postpone alarms for a short period after the sender is connected
    default postpone alarms on connect seconds = 60

    # seconds of health log events to keep
    #default health log history = 432000

    # need to route metrics differently? set these.
    # the defaults are the ones at the [stream] section (above)
    #default proxy enabled = yes | no
    #default proxy destination = IP:PORT IP:PORT ...
    #default proxy api key = API_KEY
    #default proxy send charts matching = *

    # Stream Compression
    # By default it is enabled.
    # You can control stream compression in this parent agent stream with options: yes | no
    #enable compression = yes

    # Replication
    # Enable replication for all hosts using this api key. Default: enabled
    enable replication = yes

    # How many seconds to replicate from each child. Default: a day
    seconds to replicate = 86400

    # The duration we want to replicate per each step.
    replication_step = 600

# -----------------------------------------------------------------------------
# 3. PER SENDING HOST SETTINGS, ON PARENT NETDATA
#    THIS IS OPTIONAL - YOU DON'T HAVE TO CONFIGURE IT

# This section exists to give you finer control of the parent settings for each
# child host, when the same API key is used by many netdata child nodes / proxies.
#
# Each netdata has a unique GUID - generated the first time netdata starts.
# You can find it at /var/lib/netdata/registry/netdata.public.unique.id
# (at the child).
#
# The host sending data will have one. If the host is not ephemeral,
# you can give settings for each sending host here.

[MACHINE_GUID]
    # This GUID is to be used as a MACHINE GUID from remote agents connecting
    # to this machine, not an API key.
    # YOU MUST SET THIS FIELD ON ALL MACHINE GUIDs.
    type = machine

    # enable this host: yes | no
    # When disabled, the parent will not receive metrics for this host.
    # THIS IS NOT A SECURITY MECHANISM - AN ATTACKER CAN SET ANY OTHER GUID.
    # Use only the API key for security.
    enabled = no

    # A list of simple patterns matching the IPs of the servers that
    # will be pushing metrics using this MACHINE GUID.
    # The metrics are received via the API port, so the same IPs
    # should also be matched at netdata.conf [web].allow connections from
    # and at stream.conf [API_KEY].allow from
    allow from = *

    # The number of entries in the database.
    # This is ignored for db mode dbengine.
    #history = 3600

    # The memory mode of the database: save | map | ram | none | dbengine
    #memory mode = dbengine

    # Health / alarms control: yes | no | auto
    #health enabled = auto

    # postpone alarms when the sender connects
    postpone alarms on connect seconds = 60

    # seconds of health log events to keep
    #health log history = 432000

    # need to route metrics differently?
    # the defaults are the ones at the [API KEY] section
    #proxy enabled = yes | no
    #proxy destination = IP:PORT IP:PORT ...
    #proxy api key = API_KEY
    #proxy send charts matching = *

    # Stream Compression
    # By default, enabled.
    # You can control stream compression in this parent agent stream with options: yes | no
    #enable compression = yes

    # Replication
    # Enable replication for all hosts using this api key.
    #enable replication = yes

    # How many seconds to replicate from each child.
    #seconds to replicate = 86400

    # The duration we want to replicate per each step.
    #replication_step = 600

Environment/Browser/Agent’s version etc

  • x86_64 Linux – primarily Alma 9, Debian, and Proxmox – all fully updated
  • netdata v1.42.0-217-nightly on all nodes, parent and children

What I expected to happen

All metrics are streamed consistently to the parent

Additional note that I failed to mention: on the local children dashboards all the metrics appear as they should, so they’re being collected and recorded by the local node appropriately.

Hi @StarkZarn

A few first questions: How is the link between them? Are you on a local network? Is it possible to try without SSL ?

The error.log on children and parent should provide some clues on what’s going on. Typically small gaps should be refilled when connection is re-established.

If you wish, please forward the error.logs to manolis@netdata.cloud

Thanks!

error.log unfortunately didn’t mention any issues with streaming.

I did try without SSL for streaming and ultimately ran into the same thing.

However, I’m happy to report that now everything appears to be working correctly. Unfortunately I have no idea what changed. The configuration is as I posted it before for the parent. The children have since been changed to have health and ml disabled, but the dbengine and stream config are the same.

I don’t like problems that solve themselves, but I’ll happily take the working product. Thanks!

Same here :slight_smile:

Do let us know if the issue re-appears at any time.