Netdata Community

Nodes shown as unreachable

Problem/Question

Hi,

using the app.netdata.cloud shows my nodes as unreachable when they were working fine in the past.

I have tried to remove the nodes and re-add them without success.
I have also tried to delete the war room and re-add the nodes again without success.

In both cases since the node name was already used in the past I had to add the parameter “-id=$(uuidgen)” in the claim commands.

On the web I am seeing the error:

"Agent needs update

Please update your agent to the version 1.26 to use the new Cloud Overview!"

but my agents are updated and have the version :

"Your netdata version: v1.31.0-1-nightly"

as reported when I connect to them directly to port 19999

So it seems to me that my local installation is working fine but the Cloud Netdata is not.

The same behavior is happening even when firewall is down.

This is what I see in the error.log :

2021-05-23 23:02:18: netdata INFO  : ACLK_Main : Attempting to establish the agent cloud link
2021-05-23 23:02:18: netdata INFO  : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/ec7dd3cb-457b-408f-b632-78b5c26f409f/challenge
2021-05-23 23:02:18: netdata INFO  : ACLK_Main : aclk_send_https_request GET
2021-05-23 23:02:18: netdata ERROR : ACLK_Main : Challenge failed:  (errno 99, Cannot assign requested address)
2021-05-23 23:02:18: netdata INFO  : ACLK_Main : Retrying to establish the ACLK connection in 594.769 seconds

Environment/Browser

I am having the same problem using both latest versions of Firefox 88.0.1 and Chrome 90.0.4430.212 browsers under fully updated Windows 10

EDIT: My agents are installed in the nodes that I would like to monitor which are Linux CentOS Linux release 8.3.2011 fully updated.

What I expected to happen

I was expecting to be able to see statistics and node reports directly from Netdata Cloud.

Any ideas what I should do to make them back available??

Regards

Can you please provide info from netdata -W buildinfo ?

Hi @underhood. Here is the output that you have requested:

Output of `> netdata -W buildinfo`
Version: netdata v1.31.0-1-nightly
Configure options:  '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--with-bundled-lws=externaldeps/libwebsockets' '--with-libJudy=externaldeps/libJudy' 'CFLAGS=-O2' 'LDFLAGS='
Features:
    dbengine:                YES
    Native HTTPS:            YES
    Netdata Cloud:           YES
    Cloud Implementation:    Legacy
    TLS Host Verification:   YES
Libraries:
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    LWS:                     YES static v3.2.2
    mosquitto:               YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: NO

I would really like to make this to work.

Any ideas or things that I could try?

Well the error comes from libwebsockets which we use to get OTP/credentials so the agent can connect to cloud.

It seems while we try to do GET HTTP call we get system error (errno 99, Cannot assign requested address) but not sure why would we get this on TCP outbound connection.

I will try to check if I can reproduce the issue on CentOS8. Another thing we could try is compile Netdata with system libwebsockets if you have one available on the system just to see if newer LWS helps here.

How many TCP connections does the box have? Can you be running out of ephemeral ports?

Not really…The number is approximately 250-300 depending on the server.

I have upgraded to the latest version without any change.

Output of netdata -W buildinfo
Version: netdata v1.31.0-22-nightly
Configure options:  '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--with-bundled-lws=externaldeps/libwebsockets' '--with-libJudy=externaldeps/libJudy' 'CFLAGS=-O2' 'LDFLAGS='
Features:
    dbengine:                YES
    Native HTTPS:            YES
    Netdata Cloud:           YES
    Cloud Implementation:    Legacy
    TLS Host Verification:   YES
Libraries:
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    LWS:                     YES static v3.2.2
    mosquitto:               YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: NO

I have also tried to delete the “cloud.d” directory also without any success in connecting to the cloud. I did that in conjunction with Nodes and War Room deletions and reclamation and creation of another one but still no success.

The weird thing is that the cloud app still thinks that I am on v.1.26. Is there any possibility somehow has cached the server name or any other info? But then again this doesn’t explain the connection failure…

In terms of connectivity DNS for “app.netdata.cloud” is resolved correctly and “telnet” on port 443 is also successful.

And once again the local runners on port 19999 are working properly on all nodes :slight_smile:

Any hints before trying re-compilation with the system’s libwebsockets ? Should I remove the installed?

The cloud connection is not made in your case (it fails to do the HTTP calls before the actual connection to the cloud). Therefore cloud has no way of knowing what is your current version. It shows the last cached information.

No other ideas other than:
a) we are getting OS error (errno 99, Cannot assign requested address) which I found online can happen in case of ephemeral port exhaustion.
b) trying newer libwebsockets if system provides one --use-system-lws to the netdata-installer.sh
c) trying ACLK-NG by passing --aclk-ngto the netdata installer. This will enable the brand new Cloud connection stack bypassing libwebsockets completely.

Both b) and c) will however help only in case libwebsockets is the culprit and is not just reporting underlying system issue.

In general, to narrow down the root cause, we could provide the info required to construct the initial HTTPS call (GET/POST, URL, params and where the values can be found), so that @idet2 can run curl or wget on one of the machines to see if it works that way. These commands have a lot of debugging options that can help.

But some research shows errno 99 is a very specific type of error that tells us we’re trying to bind to an IP/port that can’t be used, usually because of an invalid assumption regarding the host IP (e.g. localhost/127.0.0.1 in containers) or because there’s something already using that IP/port combination. I don’t know enough about the exact process to be able to help more, but focus on what we need on the agent side to establish that connection and verify by checking active connections or interfaces perhaps that it can really be used on the particular systems.

Very interesting case!

Hi @Christopher_Akritid1 ,

providing this info would be very useful to see what’s going on and identify the reason why this is happening.

I’m reporting the same problem as the original poster. I have installed netdata across my fleet of 63 RHEL 6 & 7 servers. Sometime in the overnight hours between May 18 to May 20 (can’t recall exactly which night), 55 of my servers became unreachable in Netdata Cloud. These servers all used the one-liner installation script with the argument to auto-update the agent to stable builds. The remaining 8 servers that are still functional installed Netdata via RPM and are not configured to auto-update. One night I went to bed and everything was fine. The next morning 55 servers were unreachable leading me to believe the agent happened to update overnight and a bug was introduced in this version.

A similar issue was reported recently here: https://community.netdata.cloud/t/claimed-nodes-not-reporting-to-cloud/1282/5 but I don’t know if that alleged solution is really accurate. Additionally, I had reported a similar issue last fall when first trying to get onboard with Netdata Cloud: https://github.com/netdata/netdata/issues/9624. Others suspected it was an issue with CA root certs but there was no definitive answer and a subsequent update to the agent seemed to fix the problem. A few other similar reports as well from the past: https://github.com/netdata/netdata/issues/9206 and https://github.com/netdata/netdata/issues/8966.

# /opt/netdata/bin/netdata -W buildinfo
Output of: /opt/netdata/bin/netdata -W buildinfo
Version: netdata v1.31.0
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--with-bundled-lws=externaldeps/libwebsockets' '--with-libJudy=externaldeps/libJudy' 'CFLAGS=-static -O3 -I/openssl-static/include' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Features:
    dbengine:                YES
    Native HTTPS:            YES
    Netdata Cloud:           YES
    Cloud Implementation:    Legacy
    TLS Host Verification:   YES
Libraries:
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    LWS:                     YES static v3.2.2
    mosquitto:               YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    NO
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES
Output of `error.log` filtered for aclk
# error.log
2021-05-27 20:47:34: netdata INFO  : ACLK_Main : Attempting to establish the agent cloud link
2021-05-27 20:47:34: netdata INFO  : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/e8b2efa0-b9d7-11eb-a1df-b8ca3a6562a0/challenge
2021-05-27 20:47:34: netdata INFO  : ACLK_Main : aclk_send_https_request GET
2021-05-27 20:47:34: netdata ERROR : ACLK_Main : Libwebsockets: Unable to open socket
 (errno 97, Address family not supported by protocol)
2021-05-27 20:48:05: netdata ERROR : ACLK_Main : Servicing LWS took too long.
2021-05-27 20:48:05: netdata ERROR : ACLK_Main : Challenge failed:  (errno 22, Invalid argument)
2021-05-27 20:48:05: netdata INFO  : ACLK_Main : Retrying to establish the ACLK connection in 1024.000 seconds
2021-05-27 21:05:09: netdata INFO  : ACLK_Main : Attempting to establish the agent cloud link
2021-05-27 21:05:09: netdata INFO  : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/e8b2efa0-b9d7-11eb-a1df-b8ca3a6562a0/challenge
2021-05-27 21:05:09: netdata INFO  : ACLK_Main : aclk_send_https_request GET
2021-05-27 21:05:09: netdata ERROR : ACLK_Main : Libwebsockets: Unable to open socket
 (errno 97, Address family not supported by protocol)
2021-05-27 21:05:40: netdata ERROR : ACLK_Main : Servicing LWS took too long.
2021-05-27 21:05:40: netdata ERROR : ACLK_Main : Challenge failed:  (errno 22, Invalid argument)
2021-05-27 21:05:40: netdata INFO  : ACLK_Main : Retrying to establish the ACLK connection in 1024.000 seconds

Hope this additional info can help.

@Ryan_S_Di_Francesco thank you for reporting. This seems to be a different problem from both you mentioned. It is not related to certificates. This is the relevant part of the log:

Whilst @idet has:

: ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)

That of course doesn’t mean they can’t have the same root cause.

I will do testing on CentOS 8 ASAP to spot if I can reproduce the problems. Not sure RHEL is available without a license though.

@idet2 did you try any of the options I mentioned previously? (newer LWS)

First call made in case of Legacy ACLK would be:

  • GET to /api/v1/auth/node/%s/challenge (%s being your claimed id usually /var/lib/netdata/cloud.d/claimed_id)

followed by:

  • POST to /api/v1/auth/node/%s/password

you seem both to be unable to do the first GET call for various reasons.

And where do they find the claim ID to complete the first URL?

/var/lib/netdata/cloud.d/claimed_id or if installed with prefix it can be other dir for example /opt/netdata/var/lib/netdata/cloud.d/claimed_id

Hello! This is not the case in my situation…

Here is the result of GET which seems successful (for “%s” I have used the “claimed_id” as suggested):

GET Result :

curl --request GET https://app.netdata.cloud/api/v1/auth/node/%s/challenge
{“challenge”:“CRhzWLxbFZnlCDon08wDEHS6OTZ1povfIKqwUtVmIT26PNaNOyX4Z3JXN7eXv900qX0pxpY0kQO4QX5aeNeQBLYEdlhESuiKXyf2SClV08UtwZeA9M+YlC1w1appI4/sb3el+ylmVFIG63wDZqNydE8SBejDhrwq+Ogdsnf0J03EbFeLte2dv7ygw8IA4/4/LjufyA6/vc7qkBEU183M0LxbtCJqCHdOzCnkCqHuYy91LMiI6d6977v+KwYtB3gNOvabaHCeIgINUn9QfjVGL9BUjfMGaY/p/qACgDJTpnDuJGACjB4yJyc78zWgj10nbWHAYDsoLH+A3i3TwtdRiQ==”}

As for the POST should I include something more to be successful?

POST Result :

curl --request POST https://app.netdata.cloud/api/v1/auth/node/%s/password
{“errorCode”:“TODO trace-id”,“errorMsgKey”:“ErrBadRequest”,“errorMessage”:“EOF”,“errorNonRetryable”:false,“errorRetryDelaySeconds”:0}

As for your previous question

unfortunately didn’t have time to do it. But if necessary after the above results I will try to assign a time slot to do it.

Best regards

@idet2 what was your install method?
I have tried by installing from source on CentOS Linux release 8.3.2011, AMD64, 4.18.0-240.22.1.el8_3.x86_64 and it seems working without an issue so far. There must be therefore something that triggers the issue…

"netdata -W buildinfo`
Version: netdata v1.31.0-31-gc9c9d92
Configure options:  '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--with-bundled-lws=externaldeps/libwebsockets' 'CFLAGS=-O2' 'LDFLAGS='
Features:
    dbengine:                NO
    Native HTTPS:            YES
    Netdata Cloud:           YES 
    Cloud Implementation:    Legacy
    TLS Host Verification:   YES
Libraries:
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    LWS:                     YES static v3.2.2
    mosquitto:               YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    NO
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: NO

Asking to check maybe it is specific install method only where it is broken.

@underhood: I have used the “one line” installer and its relevant updates.

Can you give me more information regarding the POST request? Do I need a specific payload and what? Since the GET worked I would also like to make sure and try to verify with the next step (POST).

Thanks!

@idet2 well for POST it is a bit more complicated.

We base64_decode it (result from GET) then decrypt it using the key from claiming (/var/lib/netdata/cloud.d/private.pem) and base64_encode the response and send it as JSON (in the POST payload):

{ \"response\":\"[decrypted challenge here]\" }

The result is the OTP used by the agent to connect to the cloud.

My guess however would be since GET works POST will as well.

I will create a new CentOS VM and try to install using the one-line method you used.

The install method doesn’t seem to affect it. Do you use proxy HTTP? Is your network connection direct?