Streaming metrics get dropped



  • ![0_1596574556387_missing-metrics.png](Uploading 100%)

    This is a screenshot viewing metrics of a slave netdata. As you can see on all 3 charts, there is missing data.

    When I check the log on the slave, there are tons of messages complaining about needing to restart the connection to the master.

    I restart netdata on the slave and everything starts reporting as expected again. Master and slave are all 1.23.1 on ubuntu 18.04 using packagecloud .debs.

    Has anyone seen this before? Is there a way to alert on the master when this condition occurs so I can know to restart netdata on the slave?![0_1596574811805_missing-metrics.png](Uploading 100%)



  • Is there a way you can update to 1.23.2 and see if it’s still happening?

    Also can you share the log messages?



  • 2020-07-06 23:22:28: tc-qos-helper.sh: WARNING: Cannot find file ‘/usr/lib/netdata/conf.d/tc-qos-helper.conf’.
    2020-07-06 23:22:28: tc-qos-helper.sh: WARNING: Cannot find file ‘/etc/netdata/tc-qos-helper.conf’.
    2020-07-06 23:30:52: apps.plugin ERROR : MAIN : Cannot process /proc/21459/status (command ‘postqueue’) (errno 3, No such process)
    2020-07-06 23:37:59: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: error during read (-1). Restarting connection (errno 104, Connection reset by peer)
    2020-07-06 23:37:59: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 792398447 bytes on this connecti
    2020-07-06 23:37:59: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: restart stream because socket reports errors (POLLERR) - 792398447 bytes transmitted.
    2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
    2020-07-06 23:37:59: netdata ERROR : PLUGIN[diskspace] : STREAM netdata-slave [send]: not ready - discarding collected metrics.
    2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
    2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
    2020-07-06 23:37:59: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
    2020-07-06 23:37:59: netdata INFO : PLUGINSD[apps] : STREAM netdata-slave [send]: sending metrics…
    2020-07-06 23:38:00: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
    2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
    2020-07-06 23:38:00: netdata ERROR : PLUGIN[diskspace] : STREAM netdata-slave [send]: not ready - discarding collected metrics.
    2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
    2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
    2020-07-06 23:38:00: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
    2020-07-06 23:38:00: netdata INFO : PLUGINSD[apps] : STREAM netdata-slave [send]: sending metrics…
    2020-07-06 23:38:01: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
    2020-07-06 23:38:01: netdata ERROR : STATSD : STREAM netdata-slave [send]: not ready - discarding collected metrics.
    2020-07-06 23:38:01: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 113197 bytes on this connection.
    2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
    2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
    2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
    2020-07-06 23:38:01: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
    2020-07-06 23:38:01: netdata INFO : PLUGINSD[apps] : STREAM netdata-slave [send]: sending metrics…
    2020-07-06 23:38:02: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 93011 bytes on this connection.
    2020-07-06 23:38:02: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: restart stream because socket reports errors (POLLERR) - 93011 bytes transmitted.
    2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
    2020-07-06 23:38:02: netdata ERROR : STATSD : STREAM netdata-slave [send]: not ready - discarding collected metrics.
    2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
    2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
    2020-07-06 23:38:02: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
    2020-07-06 23:38:02: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send]: discarding 341 bytes of metrics already in the buffer.
    2020-07-06 23:38:02: netdata INFO : PLUGIN[proc] : STREAM netdata-slave [send]: sending metrics…
    2020-07-06 23:38:03: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: failed to send metrics - closing connection - we have sent 110210 bytes on this connection.
    2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
    2020-07-06 23:38:03: netdata ERROR : PLUGINSD[nfacct] : STREAM netdata-slave [send]: not ready - discarding collected metrics. (errno 9, Bad file descriptor)
    2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
    2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
    2020-07-06 23:38:03: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
    2020-07-06 23:38:03: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send]: discarding 839 bytes of metrics already in the buffer.
    2020-07-06 23:38:03: netdata INFO : PLUGIN[proc] : STREAM netdata-slave [send]: sending metrics…
    2020-07-06 23:38:04: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
    2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connecting…
    2020-07-06 23:38:04: netdata ERROR : PLUGIN[diskspace] : STREAM netdata-slave [send]: not ready - discarding collected metrics.
    2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: initializing communication…
    2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: waiting response from remote netdata…
    2020-07-06 23:38:04: netdata INFO : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: established communication with a parent using protocol version 2 - ready to send metrics…
    2020-07-06 23:38:04: netdata INFO : PLUGIN[proc] : STREAM netdata-slave [send]: sending metrics…
    2020-07-06 23:38:05: netdata ERROR : STREAM_SENDER[netdata-slave] : STREAM netdata-slave [send to netdata-master:19999]: connection closed by far end. Restarting connection
    2020-07-06 23:38:05: netdata ERROR : PLUGIN[proc] : STREAM netdata-slave [send]: not ready - discarding collected metrics. (errno 2, No such file or directory)
    2020-07-06 23:38:05: netdata ERROR : STATSD : STREAM netdata-slave [send]: not ready - discarding collected metrics.



  • And then the logs continue with more of the same for the next month until a developer comes and asks me why they didn’t get alerted on a synthetic statsd chart I setup…I restart netdata and test out the alerts for the synthetic chart and it works. Ideally I could get notified of this situation before a developer asks my boss and me why alerts aren’t working.

    I wish I could upload that screenshot…

    I will try 1.23.2 to see if it improves.



  • Image uploads should “just” work, there is an image upload button as long as you don’t use the “quick reply”. The log will probably be more useful either way.



  • missing-metrics.png



  • That’s a screenshot of netdata-slave viewed from netdata-master.



  • Looking at the log from the child node it indicates that the connection is being closed from the far side (the parent node). The log messages about POLLERR indicate that the socket was closed when the agent tried to read from it. During the attempt to re-establish the connection I can see that it is repeatedly closed by the far end.

    Can we see a similar snippet of the log from the parent node to see what is happening on the far side?



  • Thank you for the reply. Here is a snippet from the master during the same timeframe:

    2020-07-06 23:12:59: tc-qos-helper.sh: WARNING: Cannot find file ‘/usr/lib/netdata/conf.d/tc-qos-helper.conf’.
    2020-07-06 23:12:59: tc-qos-helper.sh: WARNING: Cannot find file ‘/etc/netdata/tc-qos-helper.conf’.
    2020-07-06 23:37:59: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:53138] : STREAM netdata-slave [receive from [x.x.x.x]:53138]: disconnected (completed 4881858 updates).
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:53138] : STREAM netdata-slave [receive from [x.x.x.x]:53138]: receive thread ended (task id 26818)
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:53138] : thread with task id 26818 finished
    2020-07-06 23:37:59: netdata INFO : WEB_SERVER[static3] : clients wants to STREAM metrics.
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : thread created with task id 19871
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : set name of thread 19871 to STREAM_RECEIVER
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [x.x.x.x]:56772: receive thread created (task id 19871)
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : SYSTEM_INFO: free 0x55cbb50c7360
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: initializing communication…
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: Netdata is using the stream version 2.
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
    2020-07-06 23:37:59: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: receiving metrics…
    2020-07-06 23:38:00: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: disconnected (completed 79 updates).
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : STREAM netdata-slave [receive from [x.x.x.x]:56772]: receive thread ended (task id 19871)
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56772] : thread with task id 19871 finished
    2020-07-06 23:38:00: netdata INFO : WEB_SERVER[static1] : clients wants to STREAM metrics.
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : thread created with task id 19873
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : set name of thread 19873 to STREAM_RECEIVER
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [x.x.x.x]:56776: receive thread created (task id 19873)
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : SYSTEM_INFO: free 0x55cbb51eade0
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: initializing communication…
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: Netdata is using the stream version 2.
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
    2020-07-06 23:38:00: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: receiving metrics…
    2020-07-06 23:38:01: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: disconnected (completed 122 updates).
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : STREAM netdata-slave [receive from [x.x.x.x]:56776]: receive thread ended (task id 19873)
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56776] : thread with task id 19873 finished
    2020-07-06 23:38:01: netdata INFO : WEB_SERVER[static3] : clients wants to STREAM metrics.
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : thread created with task id 19887
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : set name of thread 19887 to STREAM_RECEIVER
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [x.x.x.x]:56780: receive thread created (task id 19887)
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : SYSTEM_INFO: free 0x55cbbd3249b0
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: initializing communication…
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: Netdata is using the stream version 2.
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
    2020-07-06 23:38:01: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: receiving metrics…
    2020-07-06 23:38:02: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: disconnected (completed 79 updates).
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : STREAM netdata-slave [receive from [x.x.x.x]:56780]: receive thread ended (task id 19887)
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56780] : thread with task id 19887 finished
    2020-07-06 23:38:02: netdata INFO : WEB_SERVER[static2] : clients wants to STREAM metrics.
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : thread created with task id 19889
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : set name of thread 19889 to STREAM_RECEIVER
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [x.x.x.x]:56782: receive thread created (task id 19889)
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : SYSTEM_INFO: free 0x55cbb51e4ef0
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: initializing communication…
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: Netdata is using the stream version 2.
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : Postponing health checks for 60 seconds, on host ‘netdata-slave’, because it was just connected.
    2020-07-06 23:38:02: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: receiving metrics…
    2020-07-06 23:38:03: netdata ERROR : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: disconnected (completed 133 updates).
    2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : STREAM netdata-slave [receive from [x.x.x.x]:56782]: receive thread ended (task id 19889)
    2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56782] : thread with task id 19889 finished
    2020-07-06 23:38:03: netdata INFO : WEB_SERVER[static1] : clients wants to STREAM metrics.
    2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : thread created with task id 19891
    2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : set name of thread 19891 to STREAM_RECEIVER
    2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : STREAM netdata-slave [x.x.x.x]:56786: receive thread created (task id 19891)
    2020-07-06 23:38:03: netdata INFO : STREAM_RECEIVER[netdata-slave,[x.x.x.x]:56786] : SYSTEM_INFO: free 0x55cbb51e3c70

    netdata-master has 9 other clients that never show this issue. netdata-slave is still having these issues even after upgrading all 11 hosts to 1.23.2.



  • I can’t anything in either log that shows a problem. Both sides see the connection being closed from the far end. Is there is any difference in the network configuration between the child node with the problem and the ones that work?



  • I have added this issue https://github.com/netdata/netdata/issues/9734 on Github since it will be easier for the team track it that way. We’ll also want some more information about the network setup, since the logs don’t seem to have anything helpful.



  • Thanks again for the replies. Apologies, I didn’t see them.

    I have tracked this down to long statsd statistics names and opened another issue: https://github.com/netdata/netdata/issues/9768

    I got lucky and found a simple way to reproduce: https://github.com/vince-lessbits/netdata-test

    Sorry for so much github paperwork.



  • Thank you for helping out with this, being able to reproduce it easily should help a lot.


Log in to reply