Online Nodes showing as offline (IPv6 issue)

Slind14 · September 17, 2024, 10:49am

Some of our nodes are dropping from Netdata Cloud. It looks like it might be an issue with the Amazan entry point:

 tail -f /var/log/netdata/access.log
time=2024-09-17T10:02:46.083+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 1"
time=2024-09-17T10:08:26.309+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 26 entries, queued 0"
time=2024-09-17T10:08:36.324+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 1"
time=2024-09-17T10:09:21.347+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 0"
time=2024-09-17T10:09:46.371+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 1"
time=2024-09-17T10:37:15.255+00:00 comm=netdata source=access level=warning tid=3870312 thread=ACLK_MAIN msg="ACLK DISCONNECTED"
time=2024-09-17T10:40:00.592+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 0"
time=2024-09-17T10:40:10.679+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 158 entries, queued 0"
time=2024-09-17T10:40:15.681+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 4 entries, queued 0"
time=2024-09-17T10:40:20.686+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 0"

Sep 17 10:42:54 storageFSN01 netdata[3900930]: Failed to Get ACLK environment (cannot contact ENV endpoint)
Sep 17 10:43:58 storageFSN01 netdata[3900930]: Timed out while connecting to '2600:1f18:428d:5e02::80', port '443'.
Sep 17 10:44:59 storageFSN01 netdata[3900930]: Timed out while connecting to '2600:1f18:428d:5e00::80', port '443'.
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Timed out while connecting to '2600:1f18:428d:5e01::80', port '443'.
Sep 17 10:46:01 storageFSN01 netdata[3900930]: request timed out
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Poll timed out
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Couldn't write HTTP request header into SSL connection
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Couldn't process request
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Error trying to contact env endpoint
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Failed to Get ACLK environment (cannot contact ENV endpoint)

Witold_Duranek · September 17, 2024, 11:29am

Currently we’re also supporting the IPv6 connectivity to the Netdata Cloud. Our internal metrics not exposing issues with the connectivity before and after enabling the IPv6.

Do you have any issues with connecting using the IPv6 to any other services ?

Slind14 · September 17, 2024, 8:29pm

We haven’t had issues with others but disabled to resolve this.

ralphm · September 18, 2024, 7:51am

We are looking to prevent this from happening without user intervention. Do you indeed have IPv6 where the Agent is running? Exactly what steps did you take to make it work for you?

Vitalii_B · September 18, 2024, 8:50am

Can confirm same problem. Currently for Netdata in Docker only.

ralphm · September 18, 2024, 2:05pm

Can you (both) confirm the Agent versions you are seeing this with? We landed a fix in 1.47 and up for the inability to establish the MQTT connection over IPv6, after authentication. If you have an older version, you may be hitting that bug.

If not, could you please provide me your Space ID in a DM?

Vitalii_B · September 19, 2024, 3:36pm

$ docker compose exec -T netdata netdata -V
netdata v1.47.1

netdata  | time=2024-09-19T15:34:59.070+00:00 comm=netdata source=daemon level=error errno="110, Connection timed out" tid=2265763 thread=ACLK_MAIN dst_ip=2600:1f18:428d:5e02::80 dst_port=443 msg="Timed out while connecting to '2600:1f18:428d:5e02::80', port '443'."

Sending Space ID to DM.

Witold_Duranek · September 20, 2024, 8:23am

Hello,

Can you make a HTTP request using ie. curl inside the docker container and send us the output ?

$ curl -Iv https://app.netdata.cloud

Vitalii_B · September 20, 2024, 11:11am

I think I’ve found the reason. It is a Hetzner Firewall. If it is enabled for IPv6 it blocks that traffic if there is no rules specified for IPv6.

$ docker compose exec -T netdata curl -Iv https://app.netdata.cloud
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying [2600:1f18:428d:5e01::80]:443...
*   Trying 44.196.50.41:443...
* Connected to app.netdata.cloud (44.196.50.41) port 443 (#0)
* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2680 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=app.netdata.cloud
*  start date: Sep 13 21:54:11 2024 GMT
*  expire date: Dec 12 21:54:10 2024 GMT
*  subjectAltName: host "app.netdata.cloud" matched cert's "app.netdata.cloud"
*  issuer: C=US; O=Let's Encrypt; CN=R10
*  SSL certificate verify ok.
} [5 bytes data]
* using HTTP/2
* h2h3 [:method: HEAD]
* h2h3 [:path: /]
* h2h3 [:scheme: https]
* h2h3 [:authority: app.netdata.cloud]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x55714fffcce0)
} [5 bytes data]
> HEAD / HTTP/2
> Host: app.netdata.cloud
> user-agent: curl/7.88.1
> accept: */*
>
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
< HTTP/2 200
< accept-ranges: bytes
< access-control-allow-credentials: true
< cache-control: no-cache
< content-length: 2912
< content-type: text/html
< date: Fri, 20 Sep 2024 10:55:27 GMT
< etag: "66eb06a8-b60"
< expires: Thu, 01 Jan 1970 00:00:01 GMT
< last-modified: Wed, 18 Sep 2024 16:58:16 GMT
< server: nginx
< vary: Accept-Encoding
< x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
<
  0  2912    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host app.netdata.cloud left intact
HTTP/2 200
accept-ranges: bytes
access-control-allow-credentials: true
cache-control: no-cache
content-length: 2912
content-type: text/html
date: Fri, 20 Sep 2024 10:55:27 GMT
etag: "66eb06a8-b60"
expires: Thu, 01 Jan 1970 00:00:01 GMT
last-modified: Wed, 18 Sep 2024 16:58:16 GMT
server: nginx
vary: Accept-Encoding
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN

The other question is why netdata do not tries to use IPv4 if IPv6 not accessible?

Witold_Duranek · September 20, 2024, 11:24am

There are some improvements:

github.com/netdata/netdata

Reduce connection timeout and fallback to IPV4 for ACLK connections

netdata:master ← stelfrag:aclk_reduce_timeout

opened 02:47PM - 17 Sep 24 UTC

stelfrag

+62 -28

##### Summary When establishing an ACLK connection: - Reduce initial conne…ction timeout to 10 seconds (down from 30) - Fallback to IPV4 connection if we get IPV6 connection timeout - On successful connection the fallback flag is reset so that connecting with IPV6 will be retried if the connection needs to be reestablished

Vitalii_B · September 20, 2024, 11:30am

I’m using stable branch of Netdata. So if these changes not in stable - it’s not implemented to my nodes of Netdata. Great to know you’re already fixed that in future releases. Thank you!

Topic		Replies	Views
Nodes unreachable (errno 99, Cannot assign requested address) Help cloud	109	7301	July 27, 2021
ACLK_MAIN ERROR - "Connection Problem" With Proxy Help cloud	18	907	June 7, 2024
I am having trouble monitoring a node in Netdata cloud, my node appears unreachable Help cloud	9	1384	December 29, 2021
Servers Incorrectly "Offline" Help cloud	9	1158	February 14, 2022
Suspecting blocked IP address by the server, but banned by cloud Help cloud	5	401	October 17, 2023

Online Nodes showing as offline (IPv6 issue)

Related topics