we’re running a parent-child setup, roughly 20 children stream metrics to a parent node, but collection has been tuned down from 1 to 15 seconds. Connection goes over public internet, traffic is encrypted using a self-signed certificate. There is a mix of IPv4 and IPv6 connections, the parent node accepts both but some older nodes still use only IPv4, while all newer ones have both IPv4 and IPv6 enabled, in case this is relevant.
I added three new servers in early March, two of them are unable to connect to the parent node, both with the same message.
2023-04-05 05:46:45: netdata INFO : SNDR[new-child-02] : STREAM new-child-02: connecting to 'parent:19999' (default port: 19999)...
2023-04-05 05:46:45: netdata ERROR : SNDR[new-child-02] : STREAM new-child-02 [send to parent:19999]: remote node response is not understood, is it Netdata? - will retry in 60 secs, at 2023-04-05 05:47:45
On the parent node, I don’t see any indication from the logs that those nodes even tried to connect.
The third one is able to send metrics, but notes this in the logs:
2023-04-05 05:25:42: netdata ERROR : SNDR[new-child-03] : SSL_write() returned 0 bytes, SSL error 5
The message repeats a few times before the log protection kicks in.
I did not realise the issue earlier and then first updated our installation (1.35.1 → 1.38.1, apt installation).
Setup-wise, the two child nodes that are unable connect should be identical to the one node that can connect. Or at least, that I’m aware of.
What I tried so far:
- Disable SSL and remove the certificate entirely from the parent. Then the nodes are able to connect, but this is not a good solution in our case, as the metrics are submitted over a public connection.
- Rotate the certificate, nodes are unable to connect.
- Apply what has been suggested here, nodes are still unable to connect.
Child streaming configuration:
[stream]
enabled = yes
destination = parent-node:19999:SSL
api key = 8D6C6CD7-0B8E-4860-B6E6-78CB27D7A344
CAfile = /etc/netdata/ssl/cert.pem
timeout seconds = 60
ssl skip certificate verification = yes
reconnect delay seconds = 60
initial clock resync iterations = 60
Parent stream configuration:
[8D6C6CD7-0B8E-4860-B6E6-78CB27D7A344]
enabled = yes
default memory mode = dbengine
health enabled by default = auto
Child systems:
Linux new-child-02 5.10.0-21-cloud-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
/etc/cloud-release:ID="genericcloud"
/etc/cloud-release:VERSION="20221219-1234"
/etc/os-release:PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION_ID="11"
/etc/os-release:VERSION="11 (bullseye)"
/etc/os-release:VERSION_CODENAME=bullseye
/etc/os-release:ID=debian
Parent system:
Linux netdata-production 4.19.0-20-cloud-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux
/etc/os-release:PRETTY_NAME="Debian GNU/Linux 10 (buster)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION_ID="10"
/etc/os-release:VERSION="10 (buster)"
/etc/os-release:VERSION_CODENAME=buster
/etc/os-release:ID=debian
Help is much appreciated!