Slind14
September 17, 2024, 10:49am
1
Some of our nodes are dropping from Netdata Cloud. It looks like it might be an issue with the Amazan entry point:
tail -f /var/log/netdata/access.log
time=2024-09-17T10:02:46.083+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 1"
time=2024-09-17T10:08:26.309+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 26 entries, queued 0"
time=2024-09-17T10:08:36.324+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 1"
time=2024-09-17T10:09:21.347+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 0"
time=2024-09-17T10:09:46.371+00:00 comm=netdata source=access level=notice tid=3870306 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 1"
time=2024-09-17T10:37:15.255+00:00 comm=netdata source=access level=warning tid=3870312 thread=ACLK_MAIN msg="ACLK DISCONNECTED"
time=2024-09-17T10:40:00.592+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 0"
time=2024-09-17T10:40:10.679+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 158 entries, queued 0"
time=2024-09-17T10:40:15.681+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 4 entries, queued 0"
time=2024-09-17T10:40:20.686+00:00 comm=netdata source=access level=notice tid=3901263 thread=HEALTH msg="ACLK STA [*************** (N/A)]: Processed 1 entries, queued 0"
Sep 17 10:42:54 storageFSN01 netdata[3900930]: Failed to Get ACLK environment (cannot contact ENV endpoint)
Sep 17 10:43:58 storageFSN01 netdata[3900930]: Timed out while connecting to '2600:1f18:428d:5e02::80', port '443'.
Sep 17 10:44:59 storageFSN01 netdata[3900930]: Timed out while connecting to '2600:1f18:428d:5e00::80', port '443'.
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Timed out while connecting to '2600:1f18:428d:5e01::80', port '443'.
Sep 17 10:46:01 storageFSN01 netdata[3900930]: request timed out
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Poll timed out
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Couldn't write HTTP request header into SSL connection
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Couldn't process request
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Error trying to contact env endpoint
Sep 17 10:46:01 storageFSN01 netdata[3900930]: Failed to Get ACLK environment (cannot contact ENV endpoint)
1 Like
Currently we’re also supporting the IPv6 connectivity to the Netdata Cloud. Our internal metrics not exposing issues with the connectivity before and after enabling the IPv6.
Do you have any issues with connecting using the IPv6 to any other services ?
Slind14
September 17, 2024, 8:29pm
3
We haven’t had issues with others but disabled to resolve this.
ralphm
September 18, 2024, 7:51am
4
We are looking to prevent this from happening without user intervention. Do you indeed have IPv6 where the Agent is running? Exactly what steps did you take to make it work for you?
Can confirm same problem. Currently for Netdata in Docker only.
ralphm
September 18, 2024, 2:05pm
6
Can you (both) confirm the Agent versions you are seeing this with? We landed a fix in 1.47 and up for the inability to establish the MQTT connection over IPv6, after authentication. If you have an older version, you may be hitting that bug.
If not, could you please provide me your Space ID in a DM?
$ docker compose exec -T netdata netdata -V
netdata v1.47.1
netdata | time=2024-09-19T15:34:59.070+00:00 comm=netdata source=daemon level=error errno="110, Connection timed out" tid=2265763 thread=ACLK_MAIN dst_ip=2600:1f18:428d:5e02::80 dst_port=443 msg="Timed out while connecting to '2600:1f18:428d:5e02::80', port '443'."
Sending Space ID to DM.
Hello,
Can you make a HTTP request using ie. curl inside the docker container and send us the output ?
$ curl -Iv https://app.netdata.cloud
I think I’ve found the reason. It is a Hetzner Firewall. If it is enabled for IPv6 it blocks that traffic if there is no rules specified for IPv6.
$ docker compose exec -T netdata curl -Iv https://app.netdata.cloud
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying [2600:1f18:428d:5e01::80]:443...
* Trying 44.196.50.41:443...
* Connected to app.netdata.cloud (44.196.50.41) port 443 (#0)
* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2680 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
* subject: CN=app.netdata.cloud
* start date: Sep 13 21:54:11 2024 GMT
* expire date: Dec 12 21:54:10 2024 GMT
* subjectAltName: host "app.netdata.cloud" matched cert's "app.netdata.cloud"
* issuer: C=US; O=Let's Encrypt; CN=R10
* SSL certificate verify ok.
} [5 bytes data]
* using HTTP/2
* h2h3 [:method: HEAD]
* h2h3 [:path: /]
* h2h3 [:scheme: https]
* h2h3 [:authority: app.netdata.cloud]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x55714fffcce0)
} [5 bytes data]
> HEAD / HTTP/2
> Host: app.netdata.cloud
> user-agent: curl/7.88.1
> accept: */*
>
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
< HTTP/2 200
< accept-ranges: bytes
< access-control-allow-credentials: true
< cache-control: no-cache
< content-length: 2912
< content-type: text/html
< date: Fri, 20 Sep 2024 10:55:27 GMT
< etag: "66eb06a8-b60"
< expires: Thu, 01 Jan 1970 00:00:01 GMT
< last-modified: Wed, 18 Sep 2024 16:58:16 GMT
< server: nginx
< vary: Accept-Encoding
< x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
<
0 2912 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Connection #0 to host app.netdata.cloud left intact
HTTP/2 200
accept-ranges: bytes
access-control-allow-credentials: true
cache-control: no-cache
content-length: 2912
content-type: text/html
date: Fri, 20 Sep 2024 10:55:27 GMT
etag: "66eb06a8-b60"
expires: Thu, 01 Jan 1970 00:00:01 GMT
last-modified: Wed, 18 Sep 2024 16:58:16 GMT
server: nginx
vary: Accept-Encoding
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
The other question is why netdata do not tries to use IPv4 if IPv6 not accessible?
There are some improvements:
netdata:master
← stelfrag:aclk_reduce_timeout
opened 02:47PM - 17 Sep 24 UTC
##### Summary
When establishing an ACLK connection:
- Reduce initial conne… ction timeout to 10 seconds (down from 30)
- Fallback to IPV4 connection if we get IPV6 connection timeout
- On successful connection the fallback flag is reset so that connecting with IPV6 will be retried if the connection needs to be reestablished
I’m using stable branch of Netdata. So if these changes not in stable - it’s not implemented to my nodes of Netdata. Great to know you’re already fixed that in future releases. Thank you!