I have recently switched to the new Netdata 1.30.0 version with ACLK-NG support. While doing so, I experienced a bunch of issues with agents not connecting to the cloud, leading me to reclaim my agents multiple times and finally deleting and reopening my Netdata Cloud account to get rid of all of the dead agents in the UI.
Now all of my nodes fail to connect to the cloud. From enabling debug/tracing for ACLK, I have figured out that the Netdata Agent is in a loop failing to get the challenge from Netdata Cloud. Trying to download the challenge using curl, results in a potentially interesting response. (I realize that this might just caused by missing headers/body, but it seems like the error is potentially relevant.)
This is the case for all three of my Netdata agents.
Environment/Browser
I would have liked to include buildinfo and logs here, but the forum software recognizes them as link spam and disallows me from doing so.
Before upgrading, I used 1.29.3, the bug occurs on 1.30.0 with ACLK-NG.
I think my talk about the broken challenge was a red herring. I resolved it by reclaiming my agent again. The relevant part of the debug log seems to be the following:
Apr 05 21:24:40 nixos-laptop netdata[28626]: Attempting connection now
Apr 05 21:24:40 nixos-laptop netdata[28626]: Setting ACLK target host=app.netdata.cloud port=443 from https://app.netdata.cloud
Apr 05 21:24:40 nixos-laptop netdata[28626]: Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/NODE_ID/challenge
Apr 05 21:25:41 nixos-laptop netdata[28626]: No response available - SSL_read()=0
Apr 05 21:25:41 nixos-laptop netdata[28626]: Challenge failed:
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] E: MQTT Connection refused "The data in the user name or password is malformed"
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] E: Error mqtt_sync
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] E: Error connecting to MQTT WSS server "app.netdata.cloud", port 443.
Apr 05 21:25:42 nixos-laptop netdata[28626]: Connect failed
Apr 05 21:25:42 nixos-laptop netdata[28626]: Wait before attempting to reconnect in 1.837 seconds
I have built Netdata without using the provided installer script, so maybe this is an issue with OpenSSL or another library? (Although I have also tried building with an older version of OpenSSL and also LibreSSL). Should I open an issue about this?
Just to clarify here the problem seems to be the failed challenge (e.g. issue with the HTTP client used to do the challenge/reponse before MQTT connection is made)