ErrAlreadyConnected when agent logging in to cloud

Problem/Question

I found a few nodes that were showing as unreachable on Netdata cloud, and looked at their /var/log/netdata/error.log to see why (after a quick restart of Netdata didn’t fix it).

2022-01-25 08:31:38: netdata INFO  : ACLK_Main : Setting ACLK target host=app.netdata.cloud port=443 from https://app.netdata.cloud
2022-01-25 08:31:38: netdata INFO  : ACLK_Main : Attempting to establish the agent cloud link
2022-01-25 08:31:38: netdata INFO  : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/945ebf08-1565-11ec-80ad-56bffb5449fd/challenge
2022-01-25 08:31:38: netdata INFO  : ACLK_Main : aclk_send_https_request GET
2022-01-25 08:31:39: netdata ERROR : ACLK_Main : Challenge failed: {"errorCode":"TODO trace-id","errorMsgKey":"ErrAlreadyConnected","errorMessage":"node already connected","errorNonRetryable":false,"errorRetryDelaySeconds":0}

This is on 4 different nodes, the other IDs are b3e40ef0-1565-11ec-98fe-3e218f5f8a8a, d547bb00-1565-11ec-bb9f-e6b3a9af6db2, and db6141fa-1565-11ec-8978-62cebe3c9fe1. Oddly a 5th server from the same batch, c7984d7c-5399-11ec-81ce-425b9a6e13b2, is working.

I think I’m just going to re-register these nodes, but would like to track down the bug to prevent it from happening again.

Environment/Browser

Ubuntu 16.04, Netdata 1.31.0 from packagecloud.io Xenial repository

Hi @derobert_work !

Most likely, the cloud is detecting either the same machine_guid, or same claim_id for these nodes. These id’s should in principle be unique among nodes.

The IDs you mention, are machine_guids?

Can you check from each one’s /api/v1/info page what their claim ids are?

Are those nodes by any chance docker or vm nodes that are copied (and thus retaining same machine_guids or claim ids) ?

Thanks!

Those UUIDs I posted were from /var/lib/netdata/cloud.d/claimed_id, so I think those were the claim IDs.

That’s the API on the individual netdata instance, not app.netdata.cloud, right? In which case, unfortunately, I already reset them ( rm -Rf /var/lib/netdata/cloud.d/ /var/lib/netdata/registry/netdata.public.unique.id, restart Netdata, re-run cloud register)… so I’m not sure it’ll be useful.

Anyway, I checked #1 via the API and it’s uid, mirrored_hosts_status.guid, and mirrored_hosts_status.claim_id are all 2e669cc8-7dfe-11ec-9cc5-56bffb5449fd which is the same as is (now) in that file.

These are VMs, but they’re not cloned (nor are they on the same physical hardware). The way they were built is that Proxmox was installed, then a VM created, and then the Ubuntu installer ran, all manually. Finally knife bootstrap was run to install Chef, and one of the Chef cookbooks downloads & installs Netdata, starts it, waits for the netdata.public.unique.id file to appear, then runs netdata-claim.sh. For the servers it broke on, I think that was done on 2021-09-14. For the one server in the group that was working before, I think it was more recent (probably 20201-12-02).

If we need more info… I can try to look around to see if we have any other servers with this log message. Unfortunately the Netdata log isn’t fed into Sumologic, so I have to search manually :frowning:

Hi @derobert_work sorry for the delay, so the errors your are getting are after the deletion of those files, right?

@Manolis_Vasilakis Before deleting them.

Deleting those files and re-registering the node gets it to work (under a new node ID, of course).