Netdata crashing when streaming is enabled because response is not understood

Hi everyone

Problem/Question

I installed netdata on two servers. One for saving the data and one collecting the data. Starting last Thursday the collector stopped working. Restarting the service or the server doesn’t work.
While debugging I saw that this only happens when streaming is enabled (stream.conf [stream] enabled = yes). Because of this is searched for connections specific errors. I filtered the log files for the section I think should contain the information to fix the issue.

Relevant docs you followed/actions you took to solve the issue

I did a clean reinstall with the kickstart script. (Install Netdata with kickstart.sh | Learn Netdata)
I did a purge of netdata (following this Uninstalling Netdata and all traces - Ubuntu Server )and a clean reinstall.
I tried to find something in the error log to debug and googled a lot but I could not find what exactly is leading to the agent crashing.

Environment/Browser/Agent’s version etc

Version:

$ netdata -v
netdata v1.37.1

netdata.conf

[global]
    run as user = netdata
    update every = 5
    hostname = Ganymede

[db]
    mode = none

[web]
    mode = none

stream.conf (removed commented out stuff)

[stream]
    enabled = yes
    destination = 192.168.9.1:19999
    api key = "*****"
    timeout seconds = 60
    default port = 19999
    send charts matching = *
    buffer size bytes = 10485760
    reconnect delay seconds = 5
    initial clock resync iterations = 60

Error.log

type or paste code here

Access log (Data Host)

2023-01-16 14:17:25: 99: 10028 '[192.168.9.13]:39748' 'CONNECTED'
2023-01-16 14:17:25: 99: 10028 '[192.168.9.13]:39748' 'DISCONNECTED'
2023-01-16 14:17:25: 99: 10028 '[192.168.9.13]:39748' 'STREAM' (sent/all = 0/0 bytes -0%, prep/sent/total = 1673875045005.65/1673875045005.84/0.20 ms) 200 'key=****&hostname=Ganymede&registry_hostname=Ganymede&machine_guid=1cb11548-95a0-11ed-936c-0242e928a3df&update_every=5&os=linux&timezone=Europe/Berlin&abbrev_timezone=CET&utc_offset=3600&hops=1&ml_capable=1&ml_enabled=1&mc_version=1&tags=&ver=16376&NETDATA_INSTANCE_CLOUD_TYPE=unknown&NETDATA_INSTANCE_CLOUD_INSTANCE_TYPE=unknown&NETDATA_INSTANCE_CLOUD_INSTANCE_REGION=unknown&NETDATA_SYSTEM_OS_NAME=unknown&NETDATA_SYSTEM_OS_ID=unknown&NETDATA_SYSTEM_OS_ID_LIKE=unknown&NETDATA_SYSTEM_OS_VERSION=unknown&NETDATA_SYSTEM_OS_VERSION_ID=unknown&NETDATA_SYSTEM_OS_DETECTION=unknown&NETDATA_HOST_IS_K8S_NODE=false&NETDATA_SYSTEM_KERNEL_NAME=Linux&NETDATA_SYSTEM_KERNEL_VERSION=5.4.0&NETDATA_SYSTEM_ARCHITECTURE=x86_64&NETDATA_SYSTEM_VIRTUALIZATION=none&NETDATA_SYSTEM_VIRT_DETECTION=systemd-detect-virt&NETDATA_SYSTEM_CONTAINER=openvz&NETDATA_SYSTEM_CONTAINER_DETECTION=systemd-detect-virt&NETDATA_CONTAINER_OS_NAME=Ubuntu&NETDATA_CONTAINER_OS_ID=ubuntu&NETDATA_CONTAINER_OS_ID_LIKE=debian&NETDATA_CONTAINER_OS_VERSION=20.04.5 LTS (Focal Fossa)&NETDATA_CONTAINER_OS_VERSION_ID=20.04&NETDATA_CONTAINER_OS_DETECTION=/etc/os-release&NETDATA_SYSTEM_CPU_LOGICAL_CPU_COUNT=4&NETDATA_SYSTEM_CPU_FREQ=1999000000&NETDATA_SYSTEM_TOTAL_RAM=4294967296&NETDATA_SYSTEM_TOTAL_DISK_SIZE=0&NETDATA_PROTOCOL_VERSION=1.1'
2023-01-16 14:17:25: STREAM: 16731 '[192.168.9.13]:39748' 'CONNECTED' host 'Ganymede' api key '****' machine guid '1cb11548-95a0-11ed-936c-0242e928a3df'
2023-01-16 14:17:25: STREAM: 16731 '[192.168.9.13]:39748' 'DISCONNECTED' host 'Ganymede' api key '****' machine guid '1cb11548-95a0-11ed-936c-0242e928a3df'

Error log (Data Host)

2023-01-16 14:20:31: netdata INFO  : WEB_SERVER[static1] : clients wants to STREAM metrics.
2023-01-16 14:20:31: netdata INFO  : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : thread created with task id 16954
2023-01-16 14:20:31: netdata INFO  : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : set name of thread 16954 to STREAM_RECEIVER
2023-01-16 14:20:31: netdata INFO  : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [192.168.9.13]:40284: receive thread created (task id 16954)
2023-01-16 14:20:31: netdata INFO  : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: initializing communication...
2023-01-16 14:20:31: netdata INFO  : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: established link with negotiated capabilities: VCAPS HLABELS CLAIM CLABELS COMPRESSION FUNCTIONS REPLICATION BINARY 
2023-01-16 14:20:31: netdata ERROR : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: cannot send ready command. (errno 22, Invalid argument)
2023-01-16 14:20:31: netdata INFO  : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: receive thread ended (task id 16954)
2023-01-16 14:20:31: netdata INFO  : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : thread with task id 16954 finished

Error log (Data Collector)

2023-01-16 14:00:20: netdata ERROR : GLOBAL_STATS : STREAM Ganymede [send]: not ready - collected metrics are not sent to parent.
2023-01-16 14:00:20: netdata INFO  : STREAM_SENDER[Ganymede] : thread created with task id 31997
2023-01-16 14:00:20: netdata INFO  : STREAM_SENDER[Ganymede] : set name of thread 31997 to STREAM_SENDER[G
2023-01-16 14:00:20: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede [send]: thread created (task id 31997)
2023-01-16 14:00:20: netdata ERROR : STREAM_SENDER[Ganymede] : Clearing stream_collected_metrics flag in charts of host Ganymede (errno 9, Bad file descriptor)
2023-01-16 14:00:20: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede: attempting to connect to '192.168.9.1:19999' (default port: 19999)...
2023-01-16 14:00:20: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: initializing communication...
2023-01-16 14:00:20: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: waiting response from remote netdata...
2023-01-16 14:00:20: netdata ERROR : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: remote node response is not understood, is it Netdata?.

What I expected to happen

Not crashing and sending data

thanks for the info @potatojuicemachine (nice username also :slight_smile: )

@Manolis_Vasilakis does this look like maybe something ACLK + streaming related - i’m not quite sure from looking at above.

Hey @potatojuicemachine ! :slight_smile:

Is it possible please to update to a nightly version of netdata, so we can check from there on?

Hi,
switched to the nightly build on the data collector and data host (1.37.1).
Bad news is that during the uninstallation everything was deleted because of the script deleting the whole /opt/netdata directory (where everything was installed into)
Good news is that after updating only the data host it started working again.
I am not sure if it is because of the nightly build or because of something in the configuration file or databases being wrong.

But since this is working again I will close the topic.

Thank you for helping.

Edit: I am unable to close the topic so i just marked the solution :slight_smile:
Edit 2: read below

I was to quick to judge.

After some minutes the server stopped working again.

I postet the error log onto pastebin: 023-01-20 15:02:58: netdata INFO : MAIN : TIMEZONE: using the contents of /etc/ - Pastebin.com because it was too long for this post. I am not really sure what to look for.

This is the part of the error log, that looks the same to me:

2023-01-20 15:13:22: netdata INFO  : STREAM_SENDER[Ganymede] : thread created with task id 6767
2023-01-20 15:13:22: netdata INFO  : STREAM_SENDER[Ganymede] : set name of thread 6767 to STREAM_SENDER[G
2023-01-20 15:13:22: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede [send]: thread created (task id 6767)
2023-01-20 15:13:22: netdata ERROR : STREAM_SENDER[Ganymede] : Clearing stream_collected_metrics flag in charts of host Ganymede (errno 9, Bad file descriptor)
2023-01-20 15:13:22: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede: attempting to connect to '192.168.9.1:19999' (default port: 19999)...
2023-01-20 15:13:22: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: initializing communication...
2023-01-20 15:13:22: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: waiting response from remote netdata...
2023-01-20 15:13:22: netdata ERROR : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: remote node response is not understood, is it Netdata?.
2023-01-20 15:13:22: netdata ERROR : STREAM_SENDER[Ganymede] : Clearing stream_collected_metrics flag in charts of host Ganymede
2023-01-20 15:13:27: netdata ERROR : STREAM_SENDER[Ganymede] : Clearing stream_collected_metrics flag in charts of host Ganymede
2023-01-20 15:13:27: netdata INFO  : STREAM_SENDER[Ganymede] : STREAM Ganymede: skipping destination '192.168.9.1:19999' (default port: 19999) due to last error (code: -1, remote node response is not understood, is it Netdata?), will retry it in 55 seconds
2023-01-20 15:13:27: netdata ERROR : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: could not connect to parent node at this time.

I am not sure if this is the same error or a different one.
I have two more servers collecting and sending data to the data host with the same configuration where everything is working as intended.

Edit:
error.log of data host:

2023-01-20 15:10:15: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : thread created with task id 15342
2023-01-20 15:10:15: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : set name of thread 15342 to RCVR[Ganymede,[
2023-01-20 15:10:15: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : STREAM Ganymede [192.168.9.13]:52984: receive thread created (task id 15342)
2023-01-20 15:10:15: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : STREAM Ganymede [receive from [192.168.9.13]:52984]: established link with negotiated capabilities: VCAPS HLABELS CLAIM CLABELS COMPRESSION FUNCTIONS REPLICATION BINARY 
2023-01-20 15:10:15: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : STREAM 'Ganymede' [receive from [192.168.9.13]:52984]: ready to receive data. STATUS: CONNECTED
2023-01-20 15:10:20: netdata ERROR : RCVR[Ganymede,[192.168.9.13]:52984] : STREAM: read_stream(): EOF while reading data from socket!
2023-01-20 15:10:20: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : STREAM 'Ganymede' [receive from [192.168.9.13]:52984]: disconnected (completed 236 updates). STATUS: DISCONNECTED (SOCKET READ ERROR)
2023-01-20 15:10:20: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : STREAM 'Ganymede' [receive from [192.168.9.13]:52984]: receive thread ended (task id 15342)
2023-01-20 15:10:20: netdata INFO  : RCVR[Ganymede,[192.168.9.13]:52984] : thread with task id 15342 finished

I’m facing similar issue with my netdata v1.37.0-158-nightly (upgraded from stable v1.37.1)

@Manolis_Vasilakis is there any progress with the host I’ve shared with you for debugging? Just upgraded it from v1.37.0-158-nightly to v1.37.0-176-nightly, issue still persists.

@Habetdin both parent and child use the nightly version of Netdata? You said “facing similar issue”. What exact issue are you facing? A child can’t connect to a parent and the error is “remote node response is not understood, is it Netdata?”?

@Habetdin @potatojuicemachine is it a direct child-parent connection or there is some proxy in between?

Hello, @ilyam8

Exact issue is:

2023-01-17 16:52:39: netdata INFO  : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node: connecting to 'parent.netdata.node:20000' (default port: 20000)...
2023-01-17 16:52:39: netdata ERROR : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node [send to parent.netdata.node:20000]: remote node response is not understood, is it Netdata? - will retry in 60 secs, at 2023-01-17 16:53:39
2023-01-17 16:53:39: netdata INFO  : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node: connecting to 'parent.netdata.node:20000' (default port: 20000)...
2023-01-17 16:53:39: netdata ERROR : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node [send to parent.netdata.node:20000]: remote node response is not understood, is it Netdata? - will retry in 60 secs, at 2023-01-17 16:54:39
2023-01-17 16:53:43: netdata INFO  : ACLK_Main : RRDCONTEXT: received checkpoint command for claim id 'c141a88a-01b9-xxxx-9762-5452003ec23b', node id 'f6fea062-2d5b-xxxx-8c28-51061d59566e', while node 'child.netdata.node' has an active context streaming.
2023-01-17 16:53:43: netdata ERROR : ACLK_Main : RRDCONTEXT: received version hash 13409193782 for host 'child.netdata.node', does not match our version hash 344465. Sending snapshot of all contexts.
2023-01-17 16:54:39: netdata INFO  : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node: connecting to 'parent.netdata.node:20000' (default port: 20000)...
2023-01-17 16:54:40: netdata ERROR : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node [send to parent.netdata.node:20000]: remote node response is not understood, is it Netdata? - will retry in 60 secs, at 2023-01-17 16:55:40
2023-01-17 16:55:40: netdata INFO  : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node: connecting to 'parent.netdata.node:20000' (default port: 20000)...
2023-01-17 16:55:40: netdata ERROR : STREAM_SENDER[child.netdata.node] : STREAM child.netdata.node [send to parent.netdata.node:20000]: remote node response is not understood, is it Netdata? - will retry in 60 secs, at 2023-01-17 16:56:40

Connection is direct, no proxies. I’ve switched the nodes to nightly version because @Manolis_Vasilakis asked for it to test the potential fixes (more info at issue #14129 @ GH).

As for now, Manolis reported that he has PoC fix for this issue, I guess we should wait for it :slight_smile:

Hi,

I am using a direct connection between host and child node.

If anyone else is reading this, this is the link to the GitHub issue: [Bug]: Dashboard empty/only displays background · Issue #14129 · netdata/netdata · GitHub

I think we will stop writing here, wait for this issue to resolve and see if it also solves this.

Hey @potatojuicemachine a couple of questions:

  1. You are not using SSL, right? (looks like it from your initial post, just to make sure).
  2. Are the parent and the child on a local network, or remote?

Thanks!

Hi,

  1. I don’t think so. I never created ot configured any ssl certificates. So if there is no default ssl then no.
  2. They are remote on two virtual servers and connected via a wireguard VPN network.

I think the issue has to do with the following. The server was hosted on a (bad) virtual server wich cause the collector process to wake up very late. Sometimes I got warnings with a wakeup delay of more than a day. Switching the server provider I did not have this problem again.