one node "is currently not connected"

Bare metal system SLES
Linux hana3 5.3.18-150300.59.46-default
A similar system is working perfectly, but this one worked after reclaim only for some hours and now it is currently not connected – permanently

# bin/netdata -W buildinfo
Version: netdata v1.38.0-106-g618123f6b
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--without-bundled-protobuf' '--disable-dependency-tracking' 'CFLAGS=-static -O2 -I/openssl-static/include -pipe' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: kickstart-static
    Binary architecture: x86_64
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES
Debug/Developer Features:
    Trace Allocations:       NO

A look into the error log

2023-02-20 09:43:26: netdata INFO  : ACLK_MAIN : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2023-02-20 09:43:26: netdata ERROR : ACLK_MAIN : Connection Error or Dropped
2023-02-20 09:43:26: netdata INFO  : ACLK_MAIN : Wait before attempting to reconnect in 1.615 seconds
2023-02-20 09:43:28: netdata INFO  : ACLK_MAIN : Attempting connection now
2023-02-20 09:43:28: netdata INFO  : ACLK_MAIN : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2023-02-20 09:43:28: netdata INFO  : ACLK_MAIN : Getting Cloud /env successful
2023-02-20 09:43:28: netdata INFO  : ACLK_MAIN : New ACLK protobuf protocol negotiated successfully (/env response).
2023-02-20 09:43:29: netdata INFO  : ACLK_MAIN : HTTPS "GET" request to "api.netdata.cloud" finished with HTTP code: 200
2023-02-20 09:43:29: netdata INFO  : ACLK_MAIN : ACLK_OTP Got Challenge from Cloud
2023-02-20 09:43:29: netdata INFO  : ACLK_MAIN : HTTPS "POST" request to "api.netdata.cloud" finished with HTTP code: 201
2023-02-20 09:43:29: netdata INFO  : ACLK_MAIN : ACLK_OTP Got Password from Cloud
2023-02-20 09:43:29: netdata INFO  : ACLK_MAIN : [mqtt_wss] I: Going to connect using internal MQTT 5 implementation
2023-02-20 09:43:29: netdata ERROR : ACLK_MAIN : [mqtt_wss] W: client_id provided is longer than 23 bytes, server might not allow that [MQTT-3.1.3-5]
2023-02-20 09:43:30: netdata INFO  : ACLK_MAIN : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2023-02-20 09:43:30: netdata INFO  : ACLK_MAIN : [mqtt_wss] I: mqtt_client: MQTT Connection Accepted By Server
2023-02-20 09:43:30: netdata INFO  : ACLK_MAIN : ACLK connection successfully established
2023-02-20 09:43:30: netdata INFO  : ACLK_MAIN : Queuing status update for node=e2e1a6aa-d2c5-427c-8883-ff915435500f, live=1, hops=0

ACLK says:

# ./bin/netdatacli aclk-state
ACLK Available: Yes
ACLK Version: 2
Protocols Supported: Protobuf
Protocol Used: Protobuf
MQTT Version: 5
Claimed: Yes
Claimed Id: acd7ea55-2ddc-XXX-ad00-173c6a31edbc
Cloud URL: https://app.netdata.cloud
Online: Yes
Reconnect count: 187
Banned By Cloud: No
Last Connection Time: 2023-02-20 09:43:30
Last Connection Time + 3 PUBACKs received: 2023-02-20 09:16:30
Last Disconnect Time: 2023-02-20 09:43:26
Received Cloud MQTT Messages: 1
MQTT Messages Confirmed by Remote Broker (PUBACKs): 2

> Node Instance for mGUID: "acd7ea55-2ddc-XXX-ad00-173c6a31edbc" hostname "hana3"
	Claimed ID: acd7ea55-2ddc-XXX-ad00-173c6a31edbc
	Node ID: e2e1a6aa-d2c5-XXX-8883-ff915435500f
	Streaming Hops: 0
	Relationship: self
	Alert Streaming Status:
		Updates: 0
		Batch ID: 0
		Last Acked Seq ID: 0
		Pending Min Seq ID: 1
		Pending Max Seq ID: 61
		Last Submitted Seq ID: 0

Any Idea whats going on here?

Hi Bernd,

I’ve found the node in question in our events log, and see that it is being bounced. Looking into it and will update when I know why.

1 Like

Hi @Bernd ! Besides the investigation from @ralphm (from which it appears the agent does not send the required messages) is it possible please to do the following:

In the agent’s netdata.conf please enable the option: conversation log = yes under the section [cloud].

This will create an extra log file called aclk.log. If possible, please allow the agent to work for a while, go into what it seems a connect/disconnect loop for a few times, then share this file to manolis@netdata.cloud.

Do disable this option again after the debug session since it can gather a lot of data!

Thanks!

1 Like

@Bernd It looks like the Agent is not sending the commands we expect it to send, so it never announces the corresponding node to be online. A connected Agent that doesn’t announce the local node to be online is automatically disconnected after a while. From the logs you provided it appears that the local clock is several minutes behind time in Cloud. If readjusting the clock fixes the issue for you, we have a better idea of what could be going wrong. Could you verify this?

1 Like

Nice find !!
the Server doesnt enabled ntp server sync, what a shame :slight_smile: hana3 is now showing up again!
Thanx, I would disable the extra logging again