Streaming not always working properly

Hello,

My monitoring consists in 24 headless collectors streaming to 1 parent. SSL is enabled using a self signed certificate.
Everything is deployed with the help of ansible.
Yesterday I tried to upgrade all my nodes to v1.37.1 which resulted in a complete failure.
Since I had at least 2 different install types amongst my nodes, I decided to uninstall everything, and then deploy again so I could have deb packages installed everywhere.

This solved all my previous issues but a new one came up :

I now have 2 nodes unable to stream to the parent with the follwing logs :

2022-12-15 14:21:10: netdata INFO  : STREAM_SENDER[child.node] : STREAM child.node: attempting to connect to 'tcp:parent.node:19998' (default port: 19999)...
2022-12-15 14:21:10: netdata INFO  : STREAM_SENDER[child.node] : STREAM child.node [send to tcp:parent.node:19998]: initializing communication...
2022-12-15 14:21:10: netdata INFO  : STREAM_SENDER[child.node] : STREAM child.node [send to tcp:parent.node:19998]: waiting response from remote netdata...
2022-12-15 14:21:10: netdata INFO  : STREAM_SENDER[child.node] : STREAM child.node [send to tcp:parent.node:19998]: established link with negotiated capabilities: VCAPS HLABELS CLAIM CLABELS FUNCTIONS REPLICATION BINARY 
2022-12-15 14:21:10: netdata ERROR : STREAM_SENDER[child.node] : Clearing stream_collected_metrics flag in charts of host child.node
2022-12-15 14:21:10: netdata INFO  : STREAM_SENDER[child.node] : STREAM child.node [send to tcp:parent.node:19998]: enabling metrics streaming...
2022-12-15 14:21:10: netdata ERROR : STREAM_SENDER[child.node] : SSL_read() returned -1 bytes, SSL error 2 (errno 11, Resource temporarily unavailable)
2022-12-15 14:21:10: netdata ERROR : STREAM_SENDER[child.node] : Clearing stream_collected_metrics flag in charts of host child.node
2022-12-15 14:21:10: netdata ERROR : STREAM_SENDER[child.node] : Clearing stream_collected_metrics flag in charts of host child.node

$ netdata -W buildinfo
Version: netdata v1.37.1
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
    Binary architecture: x86_64
    Packaging distro:  
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES
Debug/Developer Features:
    Trace Allocations:       NO

The 2 problematic servers are both part of replicated group of servers in my infrastructure. Their counterparts are working flawlessly using the very same installation procedure (ansible playbook), configuration files, OS, hardware…

Any help troubleshooting this is welcome and I’m obviously happy to provide more information if needed.

Thanks

Hi @gaelteractys thanks for your report.

We are aware of such issues, really sorry it has hit you. We are in the process of fixing it currently.

Thanks again, will update as soon as we have a solution.

@Manolis_Vasilakis Good to know you’re working on it ! I’m very curious to see the magic behind this issue !

Hi @Manolis_Vasilakis, was there any progress on this topic ?

Hi @gaelteractys

Sorry, we’re still on it. Should have some news in the following days (sorry vacations were also in the way!).

1 Like

Hi @gaelteractys

Really sorry for the long delay.

Can you please check using the latest nightly (will need to be used on both the child and the parent) whether streaming issues are better or fixed?

Thank you.

Hi @Manolis_Vasilakis
Unfortunately, I won’t be able to test it before next release since we deployed netdata on stable channel.