Servers Incorrectly "Offline"

Problem/Question

All my servers just updated to 1.33. Now, 200 of them report offline in Netdata Cloud. 80 still show online.

ALL servers (Showing Offline, and Showing Online) are running Netdata v1.33.0

Despite that, the 200 “offline” are really online, I have SSH’d out to 20 of them, and every server I’ve checked has the Netdata daemon running, seemingly just fine. I have used ansible to issue a “restart netdata” command to all my servers. This didn’t cause any of them to show up. I still see them all in my wireguard VPN, so I’m assuming the issue lies with Netdata.

Environment/Browser

Netdata Agent Installed on: Ubuntu 18.04
Browser used to access Netdata cloud: Chrome, Chromium, Firefox, and Brave, all latest versions.

What I expected to happen

For offline servers to actually be offline. For online servers to be reported as online.

Here’s a good example screenshot. I am SSH’d into the host, showing netdata’s daemon and current version, yet Netdata Cloud shows the host offline.

could you provide the grep -i aclk /var/log/netdata/error.log ?

What on earth…as far as I can tell it’s telling me it’s working just fine. As of making this post, no behavior change. Shows offline in cloud, but I can reach it no problem via my wireguard VPN.

2022-02-02 20:14:56: netdata INFO  : MAIN : Starting ACLK sync thread for host 2cbfeaf2-ba75-11eb-8bcc-98fa9b22e610 -- scratch area 786944 bytes
2022-02-02 20:14:56: netdata INFO  : MAIN : SQLite aclk sync initialization
2022-02-02 20:14:56: netdata INFO  : MAIN : SQLite aclk sync initialization completed
2022-02-02 20:14:56: netdata INFO  : ACLK_Main : thread created with task id 21576
2022-02-02 20:14:56: netdata INFO  : ACLK_Main : set name of thread 21576 to ACLK_Main
2022-02-02 20:14:56: netdata INFO  : ACLK_Main : Waiting for Cloud to be enabled
2022-02-02 20:15:01: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-02-02 20:15:01: netdata INFO  : ACLK_Main : Attempting connection now
2022-02-02 20:15:01: netdata INFO  : ACLK_Stats : thread created with task id 21960
2022-02-02 20:15:01: netdata INFO  : ACLK_Stats : set name of thread 21960 to ACLK_Stats
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : Switching ACLK to new protobuf protocol. Due to /env response.
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-02-02 20:15:02: netdata INFO  : ACLK_Main : Starting 6 query threads.
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_0 : thread created with task id 21964
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_1 : thread created with task id 21965
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_3 : thread created with task id 21967
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_5 : thread created with task id 21969
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_1 : set name of thread 21965 to ACLK_Query_1
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_0 : set name of thread 21964 to ACLK_Query_0
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_4 : thread created with task id 21968
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_3 : set name of thread 21967 to ACLK_Query_3
2022-02-02 20:15:02: netdata INFO  : ACLK_Query_2 : thread created with task id 21966

Thank you,

from the log, I can see it seems like the agent is indeed connected to the cloud. Will ping cloud guys to investigate further tomorrow.

It would be helpful if you could provide the claim_id meanwhile (maybe as a PM).

Hello :wave:

thanks a lot for reaching out @Mbrantley .

I’m taking a look on the Cloud side using the logs provided to begin with.
If something additional is needed I’ll let you know here in this thread.

Yes, same problem after update.

Hey @br0m thanks a lot for you input. We’re looking into this issue and hopefully we’ll have soon something concrete to share.

Could you provide (in DM) some node_id of your space, or even your space_id to take a look?

Thanks in advance

Just wanted to mention that we identified the issue and that we will be pushing out a fix soon @Mbrantley, @br0m

We will be sharing updates in this thread.

Hello @Mbrantley and @br0m :wave:

we applied some fixes over the past few hours and we expect those reachability inconsistencies to have been resolved.

Could you please check and let us know if everything is as expected?

Thanks in advance! :raised_hands:

@papazach

I just got back from vacation, but already the device online/offline count looks correct. I’ll go through and verify more thoroughly, but, I do believe it is now resolved.

1 Like