Childs unable to connect to parent node

we’re running a parent-child setup, roughly 20 children stream metrics to a parent node, but collection has been tuned down from 1 to 15 seconds. Connection goes over public internet, traffic is encrypted using a self-signed certificate. There is a mix of IPv4 and IPv6 connections, the parent node accepts both but some older nodes still use only IPv4, while all newer ones have both IPv4 and IPv6 enabled, in case this is relevant.

I added three new servers in early March, two of them are unable to connect to the parent node, both with the same message.

2023-04-05 05:46:45: netdata INFO  : SNDR[new-child-02] : STREAM new-child-02: connecting to 'parent:19999' (default port: 19999)...
2023-04-05 05:46:45: netdata ERROR : SNDR[new-child-02] : STREAM new-child-02 [send to parent:19999]: remote node response is not understood, is it Netdata? - will retry in 60 secs, at 2023-04-05 05:47:45

On the parent node, I don’t see any indication from the logs that those nodes even tried to connect.

The third one is able to send metrics, but notes this in the logs:

2023-04-05 05:25:42: netdata ERROR : SNDR[new-child-03] : SSL_write() returned 0 bytes, SSL error 5

The message repeats a few times before the log protection kicks in.

I did not realise the issue earlier and then first updated our installation (1.35.1 → 1.38.1, apt installation).

Setup-wise, the two child nodes that are unable connect should be identical to the one node that can connect. Or at least, that I’m aware of.

What I tried so far:

  • Disable SSL and remove the certificate entirely from the parent. Then the nodes are able to connect, but this is not a good solution in our case, as the metrics are submitted over a public connection.
  • Rotate the certificate, nodes are unable to connect.
  • Apply what has been suggested here, nodes are still unable to connect.

Child streaming configuration:

    enabled = yes
    destination = parent-node:19999:SSL
    api key = 8D6C6CD7-0B8E-4860-B6E6-78CB27D7A344
    CAfile = /etc/netdata/ssl/cert.pem
    timeout seconds = 60
    ssl skip certificate verification = yes
    reconnect delay seconds = 60
    initial clock resync iterations = 60

Parent stream configuration:

    enabled = yes
    default memory mode = dbengine
    health enabled by default = auto

Child systems:

Linux new-child-02 5.10.0-21-cloud-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
/etc/os-release:PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION="11 (bullseye)"

Parent system:

Linux netdata-production 4.19.0-20-cloud-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux
/etc/os-release:PRETTY_NAME="Debian GNU/Linux 10 (buster)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION="10 (buster)"

Help is much appreciated! :slight_smile:

Hi, @simpliandy. Try the nightly version, I believe there were some fixes regarding streaming with SSL and ~ high RTT.

If it doesn’t help, I suggest:

  • to create an issue.
  • as a temporary workaround consider creating a mesh VPN between your nodes and streaming over it without SSL.

Hi @ilyam8,

thanks for the response. I updated both the parent and child nodes to the latest nightly available on PackageCloud. This left me with a new error message on the child nodes, even on existing nodes which worked fine previously:

2023-04-11 09:30:14: netdata INFO : STREAM_SENDER[existing-child] : STREAM existing-child [send to parent:19999:SSL]: connecting...

2023-04-11 09:30:14: netdata ERROR : STREAM_SENDER[existing-child] : Cannot resolve host 'parent', port '19999:SSL': Servname not supported for ai_socktype

2023-04-11 09:30:14: netdata ERROR : STREAM_SENDER[existing-child] : STREAM existing-child [send to parent:19999:SSL]: failed to connect

however, what works now is keeping all childs on Netdata v1.38.1 while the parent uses the latest nighly. this also allows the two nodes I mentioned in my initial post to connect.

what do you suggest as next steps? should I open a ticket for the new error message?

This is a bug, please create a bug report by following the link @ilyam8 posted above.

Hi @simpliandy ! Did you resolve the issue? Or if you opened a bug report, can you link it please?

Hi guys, sorry for the lack of feedback. Couple of things on my desk right now.

So, as described I’m still running the parent node with a nightly version while leaving the childs on the latest stable release. I would double-check this week if the issue still occurs with the latest nightly and would open a bug report if so.