Streaming metrics get dropped

![0_1596574556387_missing-metrics.png](Uploading 100%)

This is a screenshot viewing metrics of a slave netdata. As you can see on all 3 charts, there is missing data.

When I check the log on the slave, there are tons of messages complaining about needing to restart the connection to the master.

I restart netdata on the slave and everything starts reporting as expected again. Master and slave are all 1.23.1 on ubuntu 18.04 using packagecloud .debs.

Has anyone seen this before? Is there a way to alert on the master when this condition occurs so I can know to restart netdata on the slave?![0_1596574811805_missing-metrics.png](Uploading 100%)

Thank you for helping out with this, being able to reproduce it easily should help a lot.

Thanks again for the replies. Apologies, I didn’t see them.

I have tracked this down to long statsd statistics names and opened another issue: Long statsd timer names break streaming replication · Issue #9768 · netdata/netdata · GitHub

I got lucky and found a simple way to reproduce: GitHub - vince-lessbits/netdata-test

Sorry for so much github paperwork.

I have added this issue Unexpected disconnects between parent and child · Issue #9734 · netdata/netdata · GitHub on Github since it will be easier for the team track it that way. We’ll also want some more information about the network setup, since the logs don’t seem to have anything helpful.

I can’t anything in either log that shows a problem. Both sides see the connection being closed from the far end. Is there is any difference in the network configuration between the child node with the problem and the ones that work?

Thank you for the reply. Here is a snippet from the master during the same timeframe:

2020-07-06 23:12:59: tc-qos-helper.sh: WARNING: Cannot find file ‘/usr/lib/netdata/conf.d/tc-qos-helper.conf’.
2020-07-06 23:12:59: tc-qos-helper.sh: WARNING: Cannot find file ‘/etc/netdata/tc-qos-helper.conf’.
2020-07-06 23:37:59: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:53138] : STREAM netdata-slave [receive from [x.x.x.x]:53138]: disconnected (completed 4881858 updates).
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:53138] : STREAM netdata-slave [receive from [x.x.x.x]:53138]: receive thread ended (task id 26818)
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:53138] : thread with task id 26818 finished
2020-07-06 23:37:59: netdata INFO : WEB_SERVER[static3] : clients wants to STREAM metrics.
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : thread created with task id 19871
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : set name of thread 19871 to STREAM_RECEIVER
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [x.x.x.x]:56772: receive thread created (task id 19871)
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : SYSTEM_INFO: free 0x55cbb50c7360
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: initializing communication…
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: Netdata is using the stream version 2.
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: receiving metrics…
2020-07-06 23:38:00: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: disconnected (completed 79 updates).
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: receive thread ended (task id 19871)
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : thread with task id 19871 finished
2020-07-06 23:38:00: netdata INFO : WEB_SERVER[static1] : clients wants to STREAM metrics.
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : thread created with task id 19873
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : set name of thread 19873 to STREAM_RECEIVER
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [x.x.x.x]:56776: receive thread created (task id 19873)
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : SYSTEM_INFO: free 0x55cbb51eade0
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: initializing communication…
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: Netdata is using the stream version 2.
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: receiving metrics…
2020-07-06 23:38:01: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: disconnected (completed 122 updates).
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: receive thread ended (task id 19873)
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : thread with task id 19873 finished
2020-07-06 23:38:01: netdata INFO : WEB_SERVER[static3] : clients wants to STREAM metrics.
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : thread created with task id 19887
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : set name of thread 19887 to STREAM_RECEIVER
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [x.x.x.x]:56780: receive thread created (task id 19887)
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : SYSTEM_INFO: free 0x55cbbd3249b0
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: initializing communication…
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: Netdata is using the stream version 2.
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: receiving metrics…
2020-07-06 23:38:02: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: disconnected (completed 79 updates).
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: receive thread ended (task id 19887)
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : thread with task id 19887 finished
2020-07-06 23:38:02: netdata INFO : WEB_SERVER[static2] : clients wants to STREAM metrics.
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : thread created with task id 19889
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : set name of thread 19889 to STREAM_RECEIVER
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [x.x.x.x]:56782: receive thread created (task id 19889)
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : SYSTEM_INFO: free 0x55cbb51e4ef0
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: initializing communication…
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: Netdata is using the stream version 2.
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: receiving metrics…
2020-07-06 23:38:03: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: disconnected (completed 133 updates).
2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: receive thread ended (task id 19889)
2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : thread with task id 19889 finished
2020-07-06 23:38:03: netdata INFO : WEB_SERVER[static1] : clients wants to STREAM metrics.
2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : thread created with task id 19891
2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : set name of thread 19891 to STREAM_RECEIVER
2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : STREAM netdata-slave [x.x.x.x]:56786: receive thread created (task id 19891)
2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : SYSTEM_INFO: free 0x55cbb51e3c70

netdata-master has 9 other clients that never show this issue. netdata-slave is still having these issues even after upgrading all 11 hosts to 1.23.2.

Looking at the log from the child node it indicates that the connection is being closed from the far side (the parent node). The log messages about POLLERR indicate that the socket was closed when the agent tried to read from it. During the attempt to re-establish the connection I can see that it is repeatedly closed by the far end.

Can we see a similar snippet of the log from the parent node to see what is happening on the far side?

That’s a screenshot of netdata-slave viewed from netdata-master.

!

Image uploads should “just” work, there is an image upload button as long as you don’t use the “quick reply”. The log will probably be more useful either way.

And then the logs continue with more of the same for the next month until a developer comes and asks me why they didn’t get alerted on a synthetic statsd chart I setup…I restart netdata and test out the alerts for the synthetic chart and it works. Ideally I could get notified of this situation before a developer asks my boss and me why alerts aren’t working.

I wish I could upload that screenshot…

I will try 1.23.2 to see if it improves.

2020-07-06 23:22:28: tc-qos-helper.sh: WARNING: Cannot find file ‘/usr/lib/netdata/conf.d/tc-qos-helper.conf’.
2020-07-06 23:22:28: tc-qos-helper.sh: WARNING: Cannot find file ‘/etc/netdata/tc-qos-helper.conf’.
2020-07-06 23:30:52: apps.plugin ERROR : MAIN : Cannot process /proc/21459/status (command ‘postqueue’) (errno 3, No such process)
2020-07-06 23:37:59: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: error during read (-1). Restarting connection (errno 104, Connection reset by peer)
2020-07-06 23:37:59: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 792398447 bytes on this connecti
2020-07-06 23:37:59: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: restart stream because socket reports errors (POLLERR) - 792398447 bytes transmitted.
2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
2020-07-06 23:37:59: netdata ERROR : PLUGIN[diskspace] : STREAM netdata-slave [send]: not ready - discarding collected metrics.
2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
2020-07-06 23:37:59: netdata INFO : PLUGINSD[apps] : STREAM netdata-slave [send]: sending metrics…
2020-07-06 23:38:00: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
2020-07-06 23:38:00: netdata ERROR : PLUGIN[diskspace] : STREAM netdata-slave [send]: not ready - discarding collected metrics.
2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
2020-07-06 23:38:00: netdata INFO : PLUGINSD[apps] : STREAM netdata-slave [send]: sending metrics…
2020-07-06 23:38:01: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
2020-07-06 23:38:01: netdata ERROR : STATSD : STREAM netdata-slave [send]: not ready - discarding collected metrics.
2020-07-06 23:38:01: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 113197 bytes on this connection.
2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
2020-07-06 23:38:01: netdata INFO : PLUGINSD[apps] : STREAM netdata-slave [send]: sending metrics…
2020-07-06 23:38:02: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 93011 bytes on this connection.
2020-07-06 23:38:02: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: restart stream because socket reports errors (POLLERR) - 93011 bytes transmitted.
2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
2020-07-06 23:38:02: netdata ERROR : STATSD : STREAM netdata-slave [send]: not ready - discarding collected metrics.
2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
2020-07-06 23:38:02: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send]: discarding 341 bytes of metrics already in the buffer.
2020-07-06 23:38:02: netdata INFO : PLUGIN[proc] : STREAM netdata-slave [send]: sending metrics…
2020-07-06 23:38:03: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 110210 bytes on this connection.
2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
2020-07-06 23:38:03: netdata ERROR : PLUGINSD[nfacct] : STREAM netdata-slave [send]: not ready - discarding collected metrics. (errno 9, Bad file descriptor)
2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
2020-07-06 23:38:03: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send]: discarding 839 bytes of metrics already in the buffer.
2020-07-06 23:38:03: netdata INFO : PLUGIN[proc] : STREAM netdata-slave [send]: sending metrics…
2020-07-06 23:38:04: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
2020-07-06 23:38:04: netdata ERROR : PLUGIN[diskspace] : STREAM netdata-slave [send]: not ready - discarding collected metrics.
2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
2020-07-06 23:38:04: netdata INFO : PLUGIN[proc] : STREAM netdata-slave [send]: sending metrics…
2020-07-06 23:38:05: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
2020-07-06 23:38:05: netdata ERROR : PLUGIN[proc] : STREAM netdata-slave [send]: not ready - discarding collected metrics. (errno 2, No such file or directory)
2020-07-06 23:38:05: netdata ERROR : STATSD : STREAM netdata-slave [send]: not ready - discarding collected metrics.

Is there a way you can update to 1.23.2 and see if it’s still happening?

Also can you share the log messages?