Hi everyone
Problem/Question
I installed netdata on two servers. One for saving the data and one collecting the data. Starting last Thursday the collector stopped working. Restarting the service or the server doesn’t work.
While debugging I saw that this only happens when streaming is enabled (stream.conf [stream] enabled = yes). Because of this is searched for connections specific errors. I filtered the log files for the section I think should contain the information to fix the issue.
Relevant docs you followed/actions you took to solve the issue
I did a clean reinstall with the kickstart script. (Install Netdata with kickstart.sh | Learn Netdata)
I did a purge of netdata (following this Uninstalling Netdata and all traces - Ubuntu Server )and a clean reinstall.
I tried to find something in the error log to debug and googled a lot but I could not find what exactly is leading to the agent crashing.
Environment/Browser/Agent’s version etc
Version:
$ netdata -v
netdata v1.37.1
netdata.conf
[global]
run as user = netdata
update every = 5
hostname = Ganymede
[db]
mode = none
[web]
mode = none
stream.conf (removed commented out stuff)
[stream]
enabled = yes
destination = 192.168.9.1:19999
api key = "*****"
timeout seconds = 60
default port = 19999
send charts matching = *
buffer size bytes = 10485760
reconnect delay seconds = 5
initial clock resync iterations = 60
Error.log
type or paste code here
Access log (Data Host)
2023-01-16 14:17:25: 99: 10028 '[192.168.9.13]:39748' 'CONNECTED'
2023-01-16 14:17:25: 99: 10028 '[192.168.9.13]:39748' 'DISCONNECTED'
2023-01-16 14:17:25: 99: 10028 '[192.168.9.13]:39748' 'STREAM' (sent/all = 0/0 bytes -0%, prep/sent/total = 1673875045005.65/1673875045005.84/0.20 ms) 200 'key=****&hostname=Ganymede®istry_hostname=Ganymede&machine_guid=1cb11548-95a0-11ed-936c-0242e928a3df&update_every=5&os=linux&timezone=Europe/Berlin&abbrev_timezone=CET&utc_offset=3600&hops=1&ml_capable=1&ml_enabled=1&mc_version=1&tags=&ver=16376&NETDATA_INSTANCE_CLOUD_TYPE=unknown&NETDATA_INSTANCE_CLOUD_INSTANCE_TYPE=unknown&NETDATA_INSTANCE_CLOUD_INSTANCE_REGION=unknown&NETDATA_SYSTEM_OS_NAME=unknown&NETDATA_SYSTEM_OS_ID=unknown&NETDATA_SYSTEM_OS_ID_LIKE=unknown&NETDATA_SYSTEM_OS_VERSION=unknown&NETDATA_SYSTEM_OS_VERSION_ID=unknown&NETDATA_SYSTEM_OS_DETECTION=unknown&NETDATA_HOST_IS_K8S_NODE=false&NETDATA_SYSTEM_KERNEL_NAME=Linux&NETDATA_SYSTEM_KERNEL_VERSION=5.4.0&NETDATA_SYSTEM_ARCHITECTURE=x86_64&NETDATA_SYSTEM_VIRTUALIZATION=none&NETDATA_SYSTEM_VIRT_DETECTION=systemd-detect-virt&NETDATA_SYSTEM_CONTAINER=openvz&NETDATA_SYSTEM_CONTAINER_DETECTION=systemd-detect-virt&NETDATA_CONTAINER_OS_NAME=Ubuntu&NETDATA_CONTAINER_OS_ID=ubuntu&NETDATA_CONTAINER_OS_ID_LIKE=debian&NETDATA_CONTAINER_OS_VERSION=20.04.5 LTS (Focal Fossa)&NETDATA_CONTAINER_OS_VERSION_ID=20.04&NETDATA_CONTAINER_OS_DETECTION=/etc/os-release&NETDATA_SYSTEM_CPU_LOGICAL_CPU_COUNT=4&NETDATA_SYSTEM_CPU_FREQ=1999000000&NETDATA_SYSTEM_TOTAL_RAM=4294967296&NETDATA_SYSTEM_TOTAL_DISK_SIZE=0&NETDATA_PROTOCOL_VERSION=1.1'
2023-01-16 14:17:25: STREAM: 16731 '[192.168.9.13]:39748' 'CONNECTED' host 'Ganymede' api key '****' machine guid '1cb11548-95a0-11ed-936c-0242e928a3df'
2023-01-16 14:17:25: STREAM: 16731 '[192.168.9.13]:39748' 'DISCONNECTED' host 'Ganymede' api key '****' machine guid '1cb11548-95a0-11ed-936c-0242e928a3df'
Error log (Data Host)
2023-01-16 14:20:31: netdata INFO : WEB_SERVER[static1] : clients wants to STREAM metrics.
2023-01-16 14:20:31: netdata INFO : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : thread created with task id 16954
2023-01-16 14:20:31: netdata INFO : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : set name of thread 16954 to STREAM_RECEIVER
2023-01-16 14:20:31: netdata INFO : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [192.168.9.13]:40284: receive thread created (task id 16954)
2023-01-16 14:20:31: netdata INFO : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: initializing communication...
2023-01-16 14:20:31: netdata INFO : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: established link with negotiated capabilities: VCAPS HLABELS CLAIM CLABELS COMPRESSION FUNCTIONS REPLICATION BINARY
2023-01-16 14:20:31: netdata ERROR : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: cannot send ready command. (errno 22, Invalid argument)
2023-01-16 14:20:31: netdata INFO : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : STREAM Ganymede [receive from [192.168.9.13]:40284]: receive thread ended (task id 16954)
2023-01-16 14:20:31: netdata INFO : STREAM_RECEIVER[Ganymede,[192.168.9.13]:40284] : thread with task id 16954 finished
Error log (Data Collector)
2023-01-16 14:00:20: netdata ERROR : GLOBAL_STATS : STREAM Ganymede [send]: not ready - collected metrics are not sent to parent.
2023-01-16 14:00:20: netdata INFO : STREAM_SENDER[Ganymede] : thread created with task id 31997
2023-01-16 14:00:20: netdata INFO : STREAM_SENDER[Ganymede] : set name of thread 31997 to STREAM_SENDER[G
2023-01-16 14:00:20: netdata INFO : STREAM_SENDER[Ganymede] : STREAM Ganymede [send]: thread created (task id 31997)
2023-01-16 14:00:20: netdata ERROR : STREAM_SENDER[Ganymede] : Clearing stream_collected_metrics flag in charts of host Ganymede (errno 9, Bad file descriptor)
2023-01-16 14:00:20: netdata INFO : STREAM_SENDER[Ganymede] : STREAM Ganymede: attempting to connect to '192.168.9.1:19999' (default port: 19999)...
2023-01-16 14:00:20: netdata INFO : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: initializing communication...
2023-01-16 14:00:20: netdata INFO : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: waiting response from remote netdata...
2023-01-16 14:00:20: netdata ERROR : STREAM_SENDER[Ganymede] : STREAM Ganymede [send to 192.168.9.1:19999]: remote node response is not understood, is it Netdata?.
What I expected to happen
Not crashing and sending data