Nodes unreachable (errno 99, Cannot assign requested address)

By the way…
During all these builds and installations I have noticed that your “install-required-packages.sh” script is failing to install “Judy-devel” by itself since it has hardcoded the following URL:
http://mirror.centos.org/centos/8/PowerTools/x86_64/os/Packages/Judy-devel-1.0.5-18.module_el8.1.0+217+4d875839.x86_64.rpm

which obviously is no longer available at that repository (produces a 404 error) since it has been replaced by newer versions (3.0):

http://mirror.centos.org/centos/8/PowerTools/x86_64/os/Packages/Judy-devel-1.0.5-18.module_el8.3.0+599+c587b2e7.x86_64.rpm

OR

http://mirror.centos.org/centos/8/PowerTools/x86_64/os/Packages/Judy-devel-1.0.5-18.module_el8.3.0+757+d382997d.x86_64.rpm

Just an update…

After many hours Web Page is now showing correctly the node and its data…

Haven’t figured out though why for quiet some time was showing as “unreachable” while logs and alerts indicating that it was working without a problem.

Switching everything to “Next Generation” ACLK (aclk-ng) to verify that they also work…

From log Excerpt 1 (HTTP code 409) means you have been temporarily banned by the cloud.

This happens when there have been too many auth requests in a too short time (or my favorite annoyance during testing → having 2 agents run kicking each other out).
Normally waiting 2 minutes before another connection attempt solves the issue (it is a temporary ban which is automatically lifted when there are not too many connection attempts anymore).

However, it looks like communication is working with ACLK-NG, also confirmed by log Ecverpt2. So it seems agent can connect to cloud regardless of /etc/hosts state.

This 2021-06-12 10:46:31: netdata INFO : ACLK_Main : MQTTWSS connection succeeded means agent connected to the cloud. As to why cloud shows it as unreachable we would have to check together with the cloud guys. Seems like unrelated issue e.g. cloud outage?

@ferroin maybe we should look at this as separate issue

By the way…
During all these builds and installations I have noticed that your “install-required-packages.sh” script is failing to install “Judy-devel” by itself since it has hardcoded the following URL:
http://mirror.centos.org/centos/8/PowerTools/x86_64/os/Packages/Judy-devel-1.0.5-18.module_el8.1.0+217+4d875839.x86_64.rpm

which obviously is no longer available at that repository (produces a 404 error) since it has been replaced by newer versions (3.0):

FYI “agents kicking each other out” happens when you clone a VM after netdata has ran and the node has been claimed (read connected). A couple of identifiers are stored on disk. If those identifiers are copied over to another VM by cloning the originating one, both agents appear as one.

Of course even if this were to happen, it has nothing to do with the /etc/hosts entry effect and the version. @underhood have we identified any commits between 1.30.1 and 1.31 that could have anything to do with these symptoms? There are a lot of workarounds in this thread, but this mystery really bugs me. I was actually wondering if there’s anything the kernel could tell us (via eBPF) a out what’s going on with that VM with the existential issues.

Well, the picture in my mind currently is as follows:

  • @idet2 reported different VM with the same OS works regardless of cloud provider used (same or different cloud provider as offending node).
  • Also, it looks like ACLK-NG can pass through the authentication calls. Although we have to be careful here as the ACLK-NG already does some extra new calls (as per new cloud architecture). Also it seems from what @idet2 reported ACLK-NG can connect to cloud regardless of state of /etc/hosts.

Therefore my thinking currently is this is what defines the issue currently:

  • limited to ACLK Legacy → libwebsockets or how we use it (as libwebsockets are used to fo HTTP here). My custom impl. oh HTTP client in ACLK-NG seems to connect fine.
  • limited to sth. specific in this particular user system (can be an error on the Netdata side, e.g. some system configuration breaks netdata etc.)
  • when we switch Legacy->NG apart from switching to different HTTP client code we also do slightly different authentication flow with cloud (updated to new cloud arch) - so it can be only the old auth is somehow failing

As per commits, the ideal would be to do git bisect here but that would help only if I can reproduce the issue. From browsing, through commits only the enable IPv6 in libwebsockets stands out. There are others but they are all ACLK-NG only

Just to let you all know and verify that ACLK-NG works no matter if /etc/hosts is used.

The problem seemed to be with the old ACLK implementation.

In my opinion it looks like a combination of the systems’ configuration and NetData versions since it cannot be reproduced on a brand new system (no matter of the cloud host) but at the same time on all of my systems a previous NetData version works.

@Christopher_Akritid1 : you mentioned something about eBPF… Do you want me to do something to further help to debug it?

@underhood : Did you manage to do a git bisect?

@all : My affected systems are not having any strange/bizzare configurations or libraries set. The only thing that I can think of that differentiates them from a brand new system is that a few containers are running (around 3-5 per system) with extra isolation networks, bridge interfaces and their corresponding firewall rules, if that make any difference at all.

@all : Is there any fast way to revert back to the “old” ACLK and at the same time enable “debug” mode? Would debug provide more information on why this is happening (like which interface is trying to bind, etc.)?

My thinking here is that it’s clearly a networking issue, so perhaps the combination of the IPv6 enablement, your special network setup with the containers you mentioned, and perhaps the fact that IP reverse lookup returns the Google hostname are the most relevant hints. One thing I would try is to disable IPv6 on the legacy ACLK and see if it no longer needs the host entry. If that is proved to be related, I would then dive deeper on how the combination of the specific feature with the network setup on the specific VM could result in the behavior we saw. I’ve personally never traced such issues in such detail before, so I can’t be of more help than that.

Hi @Christopher_Akritid1 and thanks for your input.

As we have said already in previous posts IPv6 is available but fully disabled on the host. So IPv6 seems not to play a role here. At that point (when we were discussing about IPv6) I did some tests and enabled it but without routing any traffic over it…Just the loopback and main interface could be seen also on their IPv6 addresses. The result was also the same.

I have mentioned docker and its network stack because I don’t really know how LWS expects to work…Does it expect to find one interface? How does it handles bridging? Where does it tries to bind? On the lowest network or is it specific to an interface? I mean all these have been manipulated due to docker containers but again this should be an issue everywhere so it’s rather odd.

I don’t really like things the way it is because as said before it could occur again or could face something similar and since it works on the same hosts with a previous NetData version, I mean all things point to a change in your code… No matter what the underlying host configuration is…

On the other hand I understand that you guys have spend a lot of time on this and since it works with the “Next Generation” may be pointless to look for a solution on it. That’s why I am trying to help and volunteering to do things in order to try to eliminate the differences and be able to reproduce it so that you can nail it. Hence if you have any ideas that might be useful or I could be of any assistance to something let me know.

As if we didn’t have enough issues so far…

ACLK-NG is failing as well … I mean I didn’t do anything to mess with it the last couple of days that was confirmed to be working.

All systems did NOT change no matter what.

Now I get:

error.log

# cat error.log | grep ACLK
2021-06-16 04:44:44: netdata INFO : MAIN : EXIT: Stopping main thread: ACLK_Main
2021-06-16 04:44:45: netdata ERROR : ACLK_Main : Preparing to Gracefully Shutdown the ACLK
2021-06-16 04:44:45: netdata INFO : ACLK_Query_0 : thread with task id 1153723 finished
2021-06-16 04:44:45: netdata ERROR : ACLK_Main : Got PUBACK for shutdown message. Can exit gracefully.
2021-06-16 04:44:45: netdata ERROR : ACLK_Main : MQTT App Layer disconnect message sent successfully
2021-06-16 04:44:45: netdata ERROR : ACLK_Main : Attempting to Gracefully Shutdown MQTT/WSS connection
2021-06-16 04:44:45: netdata INFO : ACLK_Stats : thread with task id 1153722 finished
2021-06-16 04:44:45: netdata INFO : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2021-06-16 04:44:45: netdata INFO : ACLK_Query_1 : thread with task id 1153724 finished
2021-06-16 04:44:45: netdata INFO : ACLK_Main : thread with task id 1153439 finished
2021-06-16 04:44:53: netdata INFO : ACLK_Main : thread created with task id 1607873
2021-06-16 04:44:53: netdata INFO : ACLK_Main : set name of thread 1607873 to ACLK_Main
2021-06-16 04:44:53: netdata INFO : ACLK_Main : Waiting for netdata to be ready
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Waiting for Cloud to be enabled
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Starting ACLK popcorn timer for host “REMOVED” with GUID “REMOVED”
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Waiting for netdata to be claimed
2021-06-16 04:44:54: netdata INFO : ACLK_Stats : thread created with task id 1608096
2021-06-16 04:44:54: netdata INFO : ACLK_Stats : set name of thread 1608096 to ACLK_Stats
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Setting ACLK target host=app.netdata.cloud port=443 from https://app.netdata.cloud
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Attempting to establish the agent cloud link
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/REMOVED/challenge
2021-06-16 04:44:54: netdata INFO : ACLK_Main : aclk_send_https_request GET
2021-06-16 04:44:54: netdata ERROR : ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Retrying to establish the ACLK connection in 0.000 seconds
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Attempting to establish the agent cloud link
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/REMOVED/challenge
2021-06-16 04:44:54: netdata INFO : ACLK_Main : aclk_send_https_request GET
2021-06-16 04:44:54: netdata ERROR : ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)
2021-06-16 04:44:54: netdata INFO : ACLK_Main : Retrying to establish the ACLK connection in 1.253 seconds
2021-06-16 04:44:54: netdata INFO : GLOBAL_STATS : Restarting ACLK popcorn timer for host “REMOVED” with GUID “REMOVED”
2021-06-16 04:44:55: netdata INFO : ACLK_Main : Attempting to establish the agent cloud link
2021-06-16 04:44:55: netdata INFO : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/REMOVED/challenge
2021-06-16 04:44:55: netdata INFO : ACLK_Main : aclk_send_https_request GET
2021-06-16 04:44:55: netdata ERROR : ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)
2021-06-16 04:44:55: netdata INFO : ACLK_Main : Retrying to establish the ACLK connection in 2.081 seconds
2021-06-16 04:44:57: netdata INFO : ACLK_Main : Attempting to establish the agent cloud link
2021-06-16 04:44:57: netdata INFO : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/REMOVED/challenge
2021-06-16 04:44:57: netdata INFO : ACLK_Main : aclk_send_https_request GET
2021-06-16 04:44:57: netdata ERROR : ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)
2021-06-16 04:44:58: netdata INFO : ACLK_Main : Retrying to establish the ACLK connection in 5.302 seconds
2021-06-16 05:05:32: netdata INFO : ACLK_Main : Attempting to establish the agent cloud link
2021-06-16 05:05:32: netdata INFO : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/REMOVED/challenge
2021-06-16 05:05:32: netdata INFO : ACLK_Main : aclk_send_https_request GET
2021-06-16 05:05:32: netdata ERROR : ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)
2021-06-16 05:05:33: netdata INFO : ACLK_Main : Retrying to establish the ACLK connection in 1024.000 seconds
2021-06-16 05:22:37: netdata INFO : ACLK_Main : Attempting to establish the agent cloud link
2021-06-16 05:22:37: netdata INFO : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/REMOVED/challenge
2021-06-16 05:22:37: netdata INFO : ACLK_Main : aclk_send_https_request GET
2021-06-16 05:22:37: netdata ERROR : ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)
2021-06-16 05:22:37: netdata INFO : ACLK_Main : Retrying to establish the ACLK connection in 1024.000 seconds
2021-06-16 05:39:41: netdata INFO : ACLK_Main : Attempting to establish the agent cloud link
2021-06-16 05:39:41: netdata INFO : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/REMOVED/challenge
2021-06-16 05:39:41: netdata INFO : ACLK_Main : aclk_send_https_request GET
2021-06-16 05:39:41: netdata ERROR : ACLK_Main : Challenge failed: (errno 99, Cannot assign requested address)

Of course hosts are once again unreachable.

Restarting the service does not fix the issue.

Would it be possible this to be triggered by the NetData’s auto-update procedure? I mean I cannot think of anything to have changed after fixing it with NG

EDIT1: I think I have found the cause… Probably auto-update as was installed messed with all of my systems and now API reports _aclk_impl "Legacy.
Could it be the case? The auto-update functions does not know about the “ACLK-NG” implementation and is not set accordingly on installation? Where do I have to set it specifically in order to avoid this of happening again?

EDIT2: /etc/netdata/.environment has all flags correctly set for ACLK-NG

Environment File

cat /etc/netdata/.environment
# Created by installer
PATH=“/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin”
CFLAGS=“-O2”
LDFLAGS=“”
NETDATA_TMPDIR=“/tmp”
NETDATA_PREFIX=“”
NETDATA_CONFIGURE_OPTIONS=" --with-bundled-lws --with-bundled-libJudy"
NETDATA_ADDED_TO_GROUPS=" docker adm nobody"
INSTALL_UID=“0”
NETDATA_GROUP=“netdata”
REINSTALL_OPTIONS=“–auto-update --aclk-ng "
RELEASE_CHANNEL=“nightly”
IS_NETDATA_STATIC_BINARY=“no”
NETDATA_LIB_DIR=”/var/lib/netdata"

and should have been read successfully from netdata-updater -> /usr/libexec/netdata/netdata-updater.sh located in /etc/cron.daily

@idet2 your last log shows ACLK-NG http client and libwebsockets (external lib) client indeed have the same issue in this case.

They cannot open outgoing TCP/IP socket to the cloud due to Linux returning EC 99.

When I look online why we can get this error code on an outgoing TCP connection and the answer I keep finding over and over is Ephemeral port exhaustion. Could it be that in certain conditions you run out of ephemeral ports? It is strange for this to work 2 days and then stop out of a sudden.

You mentioned following in previous post (and it sounds like something to look into here):

The only thing that I can think of that differentiates them from a brand new system is that a few containers are running (around 3-5 per system) with extra isolation networks, bridge interfaces and their corresponding firewall rules, if that make any difference at all.

The fact that 2 completely separate HTTP clients fail:
- one written by me from scratch (in ACLK-NG), using only OpenSSL as dependency
~~- second one written by somebody else (in ACLK Legacy) in external library called libwebsockets 404 Not Found

and both fail at opening outgoing socket with same error message points to some common underlying
issue here.

EDIT: @idet2 noticed there was accidental revert to ACLK-Legacy after autoupdater run. Therefore the comments above are invalid (as they were based on wrong assumptions)

@underhood: Seems that auto-update reverted back to ACLK

See “EDITs” in previous post!

@idet2 thanks for spotting that out. It would lead us in wrong direction…

To answer your question. We changed Netdata to be able to have both ACLKs compiled in at the same time. Which one gets used is not a configuration flag.

Please post result of netdata -W buildinfo that will tell you which ACLKs are available in your agents.
Here is a piece that updates documentation with the new info:

basically if netdata -W buildinfo tells you you have aclk-ng available. You can choose it by editing netdata.conf section [cloud] key aclk implementation

In future we will be removing ACLK Legacy completely.

Definitely!

Here is the result:

Build Info

netdata -W buildinfo

Version: netdata v1.31.0-61-nightly
Configure options: ‘–prefix=/usr’ ‘–sysconfdir=/etc’ ‘–localstatedir=/var’ ‘–libexecdir=/usr/libexec’ ‘–libdir=/usr/lib’ ‘–with-zlib’ ‘–with-math’ ‘–with-user=netdata’ ‘–with-bundled-lws’ ‘–with-bundled-libJudy’ ‘CFLAGS=-O2’ ‘LDFLAGS=’
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: YES
ACLK Legacy: YES
TLS Host Verification: YES
Libraries:
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
LWS: YES static v3.2.2
mosquitto: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: NO

Indeed it seems that I have both ACLK versions available.

So what should I set in netdata.conf?

EDIT: I don’t seem able to find something in Config Documentation

in the netdata.conf section [cloud] set following key (or add it):

  aclk implementation = ng

OK! That did the trick! My hosts are once again up and running with ACLK-NG.

Thanks a lot @underhood

1 Like

@underhood I use the kickstart-static64.sh script on my systems, but it seems the --aclk-ng flag is not an install option. Are there plans to add this to the static64.sh script soon? I’m currently running v1.31.0.

@Ryan_S_Di_Francesco
I am afraid current release binary packages do not provide ACLK-NG just yet :frowning:
Current nightlies make this available already also for binary packages. In next release we plan to make ACLK-NG the default implementation. ACLK Legacy is to be obsoleted and removed as soon as ACLK-NG is bit more battle tested.

The best way to see what your netdata does and doesn’t support is running netdata -W buildinfo.

@underhood Our installations are currently configured for stable releases. Is there a way I can execute the static64.sh script with the --reinstall flag and have it switch to nightlies and also include the --aclk-ng flag?