Nodes unreachable (errno 99, Cannot assign requested address)

Hi again.

No proxy, direct connection!

The strange thing is that it used to work and can’t really figure out what’s the problem now.

Baffled here as well. Currently, I do not have any leads as to what might be the issue.
There are still the options mentioned previously:

  • using system LWS (CenOS EPEL repo seems to be @LWS4.0.1) and see if you might be hitting some bug in LWS that has been fixed meanwhile (we use LWS as HTTPS client for ACLK Legacy)
  • trying ACLK-NG which is a new and complete implementation of ACLK which works without LWS altogether
  • downgrading to older version → proof that update actually broke it and not something else that happened around the same time (system update etc.)

@idet2, @Ryan_S_Di_Francesco are you using IPv6? Noticed this (netdata-installer.sh: Enable IPv6 support in libwebsockets by pjakuszew · Pull Request #11080 · netdata/netdata · GitHub) got merged into the new release.

Hi! Not really. I mean that although IPv6 is available in both kernel and network levels all NICs are using and have been configured with IPv4.

I also have to apologise for not replying earlier at your previous messages but unfortunately another more important matter came up and had to deal with that the last days. Will spend some time on this issue the following days.

Meanwhile if you can share any links/hints for the “downgrade” procedure and the usage of ACLK-NG would be great!

Thanks!

@underhood like @idet2 IPv6 is available but NICs use only IPv4.

@underhood : Think I have found the problem… but would like confirmation from @Ryan_S_Di_Francesco as well.

So doing some packet inspection I have seen that the service tries to connect to the IP : 34.117.103.250 which in its turn doesn’t give me back the “challenge” with a “curl GET” but instead a : “response 404 (backend NotFound), service rules for the path non-existent”.

Doing an “nslookup” from my hosts as well as from different providers all pointed out that “app.netdata.cloud” is “35.196.244.138”.

I don’t know why it tries to connect to the wrong IP, if that is a legit Netdata and not something that my provider (Hetzner) routes the traffic to or if there is indeed a problem with that IP…
Would like some info from @underhood and your team that you may contact internally (unfortunately as a new user cannot mention more than 2 users in this post so that all are aware of this, like “contributors”, “agent-maintainers” etc.) . Hence, is this a correct behavior?? Is it a correct IP? Is this UP and running if legit?

Therefore, I went to “/etc/hosts” and manually added the “35.196.244.138” as “app.netdata.cloud”. Restarted the agent and voila!!! Netdata is back online immediately :-).

Can @Ryan_S_Di_Francesco verify the above solution from your end? Are you using the same provider (Hetzner) or are you somewhere else?

Please let me know of what you think about this since obviously is not the “prettiest” way to have static IP in “/etc/hosts” for “app.netdata.cloud”.

Looking forward for your confirmations and answers!

EDIT: Removing the “/etc/hosts” value host is once again unreachable!!!

Moreover here are two traceroutes:

  • Set in “/etc/hosts”
Traceroute with IP set in /etc/hosts

$ traceroute app.netdata.cloud
traceroute to app.netdata.cloud (35.196.244.138), 30 hops max, 60 byte packets
1 _gateway (172.31.1.1) 8.399 ms 8.371 ms 8.252 ms
2 10928_your-cloud_host (95.216.129.168) 2.133 ms 1.995 ms 1.794 ms
3 * * *
4 * spine2_cloud1_hel1_hetzner_com (88.198.252.57) 0.956 ms *
5 static_88-198-242-249_clients_your-server_de (88.198.242.249) 1.221 ms static_88-198-245-253_clients_your-server_de (88.198.245.253) 1.027 ms static_88-198-242-249_clients_your-server_de (88.198.242.249) 0.881 ms
6 core3_sto_hetzner_com (213.239.252.102) 6.087 ms juniper3_dc2_nbg1_hetzner_com (213.239.245.70) 6.906 ms 6.783 ms
7 142.250.161.204 (142.250.161.204) 8.443 ms 7.281 ms 7.700 ms
8 * * *
9 108.170.253.161 (108.170.253.161) 9.578 ms 209.85.242.98 (209.85.242.98) 7.420 ms 108.170.254.49 (108.170.254.49) 8.908 ms
10 108.170.254.38 (108.170.254.38) 7.678 ms 108.170.253.165 (108.170.253.165) 8.678 ms 108.170.253.166 (108.170.253.166) 8.135 ms
11 142.250.235.221 (142.250.235.221) 8.676 ms 209.85.241.45 (209.85.241.45) 8.839 ms 72.14.234.107 (72.14.234.107) 9.878 ms
12 142.251.51.185 (142.251.51.185) 23.594 ms 23.988 ms 22.597 ms
13 216.239.46.240 (216.239.46.240) 43.128 ms 42.760 ms 216.239.47.206 (216.239.47.206) 33.749 ms
14 * * 72.14.237.252 (72.14.237.252) 37.402 ms
15 142.250.238.95 (142.250.238.95) 106.394 ms 142.250.238.101 (142.250.238.10 142.250.238.95 (142.250.238.95) 105.977 ms
16 209.85.254.106 (209.85.254.106) 117.444 ms 116.386 ms 117.210 ms
17 209.85.247.210 (209.85.247.210) 116.420 ms 72.14.234.70 (72.14.234.70) 115.642 ms 216.239.42.63 (216.239.42.63) 115.265 ms
18 216.239.59.91 (216.239.59.91) 117.328 ms 142.250.57.49 (142.250.57.49) 114.901 ms 216.239.47.27 (216.239.47.27) 115.415 ms
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 app.netdata.cloud (35.196.244.138) 115.278 ms * 114.702 ms

  • NOT set in “/etc/hosts”
Traceroute with IP **NOT** set in /etc/hosts

$ traceroute app.netdata.cloud
traceroute to app_netdata_cloud (35.196.244.138), 30 hops max, 60 byte packets
1 _gateway (172.31.1.1) 14.635 ms 14.653 ms 14.587 ms
2 10928_your-cloud_host (95.216.129.168) 1.583 ms 1.506 ms 1.402 ms
3 * * *
4 spine2_cloud1_hel1_hetzner_com (88.198.252.57) 2.371 ms spine1_cloud1_hel1_hetzner_com (88.198.244.249) 1.261 ms 1.145 ms
5 core31_hel1_hetzner_com (88.198.249.89) 0.483 ms core32_hel1_hetzner_com (88.198.249.93) 0.932 ms static_88-198-245-253_clients_your-server_de (88.198.245.253) 0.840 ms
6 juniper3_dc2_nbg1_hetzner_com (213.239.245.70) 7.245 ms core3_sto_hetzner_com (213.239.252.102) 6.089 ms core3_sto_hetzner_com (213.239.224.17) 7.036 ms
7 142.250.161.204 (142.250.161.204) 7.435 ms 8.307 ms 8.381 ms
8 * * *
9 108.170.254.49 (108.170.254.49) 9.472 ms 209.85.242.98 (209.85.242.98) 8.303 ms 108.170.254.33 (108.170.254.33) 8.302 ms
10 108.170.254.54 (108.170.254.54) 8.588 ms 108.170.254.50 (108.170.254.50) 8.610 ms 108.170.254.54 (108.170.254.54) 7.165 ms
11 108.170.234.91 (108.170.234.91) 7.401 ms 209.85.241.45 (209.85.241.45) 8.610 ms 8.425 ms
12 142.251.51.185 (142.251.51.185) 24.400 ms 24.102 ms 23.923 ms
13 216.239.47.198 (216.239.47.198) 34.600 ms 216.239.47.206 (216.239.47.206) 34.587 ms 216.239.47.198 (216.239.47.198) 33.430 ms
14 72.14.233.189 (72.14.233.189) 37.244 ms 72.14.237.252 (72.14.237.252) 36.919 ms 72.14.233.189 (72.14.233.189) 35.686 ms
15 142.250.238.95 (142.250.238.95) 105.228 ms 209.85.245.57 (209.85.245.57) 104.780 ms 142.250.238.101 (142.250.238.101) 105.761 ms
16 209.85.254.106 (209.85.254.106) 117.627 ms 142.250.209.72 (142.250.209.72) 116.052 ms 115.832 ms
17 216.239.42.63 (216.239.42.63) 115.442 ms 209.85.244.14 (209.85.244.14) 116.146 ms 74.125.252.140 (74.125.252.140) 114.921 ms
18 142.250.60.43 (142.250.60.43) 115.223 ms 172.253.50.11 (172.253.50.11) 115.548 ms 172.253.66.177 (172.253.66.177) 117.000 ms
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 138.244.196.35_bc_googleusercontent_com (35.196.244.138) 115.331 ms 115.100 ms 115.720 ms

EDIT: In the above “traceroute” results I have replaced all dots “.” in domain names with underscores “_” in order to override the “5 link restriction” as a new user.

@idet2 @underhood unfortunately for me, adding the entry to /etc/hosts does not restore connectivity to netdata cloud. I’ve attempted this workaround on both RHEL 6 & 7 without luck. Oddly, with that entry in the hosts file, I’m still seeing connectivity to the alleged “incorrect” IP:

netstat -an | grep 34.117.103.250
tcp        0      0 x.x.x.x:53786        34.117.103.250:443          TIME_WAIT

Yet I see no connectivity to the correct IP:

netstat -an | grep 35.196.244.138
<empty>

@Ryan_S_Di_Francesco : Did you restart the agents after adding the entry to “/etc/hosts”?
Can you do a “traceroute” with the entry and without to make sure it has been correctly set?

EDIT: Maybe you should also flush any DNS records etc.

Are you on the same provider as I am (Hetzner)?

@idet2 sorry for the late reply I am a bit ill this week. Thank you for your analysis. I will try and ask around what the issue might be. @Austin_Hemmelgarn do you have any ideas? It seems that LWS on users system resolves Netdata cloud to different IP than nslookup?

@OdysLam Can you please lift the following restrictions for @idet2? He is very helpful in trying to debug the issue and has to do workarounds due to those restrictions.

  • “5 link restriction” as a new user (I think it considers IPs in the log Traceroute with IP **NOT** set in /etc/hosts as links or sth.)
  • Maybe also: Would like some info from @underhood and your team that you may contact internally (unfortunately as a new user cannot mention more than 2 user

I will contact people here internally (edited for clarity) so maybe the 2nd one is not that important

Thank you @underhood … No worries !

I am not sure if the problem comes from LWS since “traceroute” seems to go there as well (when no “hosts” are specified). I think that a problem might be either with your service at the specific host or some other kind of network failure. I assume that you are using GCP and have a kind of balancer or something else to properly re-direct traffic to closest locations but something seems broken (at least from my point of view).

If this is the case can you check internally and see that all these work as supposed to ?

Thanks!

EDIT: Wishing you a speed recovery

asked for help from the Cloud team internally on Slack (with link to this issue). Will see if they have any ideas :slight_smile:
Hopefully we can get to the bottom of this

Excellent! Let me know if there is anything else I can do…

@idet2 yes, I restarted the agents after adding the entry. I did the traceroute and the end result for testing with and without the /etc/hosts entry are the same as yours:

without the hosts entry:

# traceroute app.netdata.cloud
traceroute to app.netdata.cloud (35.196.244.138), 30 hops max, 60 byte packets
...
26  138.244.196.35.bc.googleusercontent.com (35.196.244.138)  18.198 ms * *

with the hosts entry:

# traceroute app.netdata.cloud
...
28  app.netdata.cloud (35.196.244.138)  18.510 ms  18.602 ms  18.561 ms

DNS does not need to be flushed as by design, the hosts files is always referenced first before a request gets referred to DNS.

I’m at a university in New York City with Nysernet as our ISP.

@Ryan_S_Di_Francesco : I am really sorry that the “/etc/hosts” solution didn’t solve your problem as it did with mine.
Do you still get the same ACLK error?
Another thing that you could try is to remove the nodes and re-claim them since I have done it multiple times in my attempts to solve the problem.
By the way have you tried the “curl GET” command to see that you are actually getting results when doing it manually? For me this worked even without the “/etc/hosts” addition.
One more thing I did before the “/etc/hosts” addition was to update the agent to:

NetData Info

netdata -W buildinfo

Version: netdata v1.31.0-41-nightly
Configure options:  '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--with-bundled-lws' '--with-bundled-libJudy' 'CFLAGS=-O2' 'LDFLAGS='
Features:
    dbengine:                YES
    Native HTTPS:            YES
    Netdata Cloud:           YES
    Cloud Implementation:    Legacy
    TLS Host Verification:   YES
Libraries:
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    LWS:                     YES static v3.2.2
    mosquitto:               YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: NO

I am just putting all my actions here just in case you need them to exactly reproduce the problem.

In any case the way I see this it has to do with Netdata and their infrastructure not functioning correctly. Let’s see what news @underhood will bring from their team.

@idet2 can you provide the contents of /var/lib/netdata/cloud.d/cloud.conf ?

Thanks

@Ryan_S_Di_Francesco can you share the output of strace -e network nc app.netdata.cloud 443

Thanks

Sure yes @Konstantinos_Natsaki ! Although there isn’t much as you can see :slight_smile:

# cat cloud.conf
[global]
  enabled = yes
  cloud base url = https://app.netdata.cloud

@Konstantinos_Natsaki here’s the requested output:

$ strace -e network nc app.netdata.cloud 443
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("35.196.244.138")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0