Netdata 1.30.0 Agents fail to connect to Cloud

Problem/Question

I have recently switched to the new Netdata 1.30.0 version with ACLK-NG support. While doing so, I experienced a bunch of issues with agents not connecting to the cloud, leading me to reclaim my agents multiple times and finally deleting and reopening my Netdata Cloud account to get rid of all of the dead agents in the UI.

Now all of my nodes fail to connect to the cloud. From enabling debug/tracing for ACLK, I have figured out that the Netdata Agent is in a loop failing to get the challenge from Netdata Cloud. Trying to download the challenge using curl, results in a potentially interesting response. (I realize that this might just caused by missing headers/body, but it seems like the error is potentially relevant.)

This is the case for all three of my Netdata agents.

Environment/Browser

I would have liked to include buildinfo and logs here, but the forum software recognizes them as link spam and disallows me from doing so.

What I expected to happen

My agents should connect to the Netdata cloud.

Hi @leotaku :wave:

I have recently switched to the new Netdata 1.30.0 version

Which version did you use?

I would have liked to include buildinfo and logs here, but the forum software recognizes them as link spam and disallows me from doing so.

This looks like a bug for more, i suggest to open an issue in the netdata repo.

Thanks @ilyam8!

Which version did you use?

Before upgrading, I used 1.29.3, the bug occurs on 1.30.0 with ACLK-NG.

I think my talk about the broken challenge was a red herring. I resolved it by reclaiming my agent again. The relevant part of the debug log seems to be the following:


Apr 05 21:24:40 nixos-laptop netdata[28626]: Attempting connection now
Apr 05 21:24:40 nixos-laptop netdata[28626]: Setting ACLK target host=app.netdata.cloud port=443 from https://app.netdata.cloud
Apr 05 21:24:40 nixos-laptop netdata[28626]: Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/NODE_ID/challenge
Apr 05 21:25:41 nixos-laptop netdata[28626]: No response available - SSL_read()=0
Apr 05 21:25:41 nixos-laptop netdata[28626]: Challenge failed:
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] E: MQTT Connection refused "The data in the user name or password is malformed"
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] E: Error mqtt_sync
Apr 05 21:25:42 nixos-laptop netdata[28626]: [mqtt_wss] E: Error connecting to MQTT WSS server "app.netdata.cloud", port 443.
Apr 05 21:25:42 nixos-laptop netdata[28626]: Connect failed
Apr 05 21:25:42 nixos-laptop netdata[28626]: Wait before attempting to reconnect in 1.837 seconds

I have built Netdata without using the provided installer script, so maybe this is an issue with OpenSSL or another library? (Although I have also tried building with an older version of OpenSSL and also LibreSSL). Should I open an issue about this?

@underhood please take a look

Hi, I have fixed such an issue (causing “The data in the user name or password is malformed”) with the PR that is incoming (the new HTTP client implements new https client for ACLK by underhood · Pull Request #10805 · netdata/netdata · GitHub).

Just to clarify here the problem seems to be the failed challenge (e.g. issue with the HTTP client used to do the challenge/reponse before MQTT connection is made)

3 Likes

Thank you @underhood, running Netdata built from a checkout of the PR you linked indeed fixes my issue!

2 Likes

@OdysLam maybe you can take a look at this?:

Environment/Browser

I would have liked to include buildinfo and logs here, but the forum software recognizes them as link spam and disallows me from doing so.

I think the issue is happening when the user tries to post code (long list of multiple lines) without the code block.

Let me see how to adjust this, since it seems that is a false positive most of the time.

@leotaku the PR fixing this has just been merged. Should be therefore fixed in the next nightly and next full release.

Thanks for reporting and confirming this indeed fixes your issue.