A realization that will help avoid confusion gents, the IP 34.117.103.250 that was seen in the outgoing packets has nothing to do with the cloud connectivity issues we’re trying to debug. That IP is just for anonymous telemetry via posthog. The traceroutes don’t show that IP anyway.
@idet2 can you also try running strace -e network nc app.netdata.cloud 443
?
Thanks!
Also @idet2 @Ryan_S_Di_Francesco, are you running the agent as a standalone process or is it within/via docker image?
Perhaps running the whole agent under strace -e network
can us understand the issue better. It is going to produce an enormous trace file but
@Christopher_Akritid1 : I understand and you are most probably right!
It was an oversight and both “traceroute” commands ended up in the same host, so not really a difference! I am sorry if this caused more trouble that it should.
Anyway, even today after setting the “/etc/hosts” I am able to reach and have the service available but when it’s removed I am not!
So since we end up in the same host why it doesn’t work without “/etc/hosts”?
I can also confirm that the “tcpdump” when the “/etc/hosts” entry is in place produces multiple entries of the “35.196.244.138” IP but when not it doesn’t.
@Konstantinos_Natsaki : Here is the result:
# strace -e network nc app.netdata.cloud 443
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("213.133.98.98")}, 16) = 0
sendmmsg(3, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="2W\1\0\0\1\0\0\0\0\0\0\3app\7netdata\5cloud\0\0"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=35}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\242o\1\0\0\1\0\0\0\0\0\0\3app\7netdata\5cloud\0\0"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=35}], 2, MSG_NOSIGNAL) = 2
recvfrom(3, "2W\201\200\0\1\0\1\0\0\0\0\3app\7netdata\5cloud\0\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("213.133.98.98")}, [28->16]) = 51
recvfrom(3, "\242o\201\200\0\1\0\0\0\1\0\0\3app\7netdata\5cloud\0\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("213.133.98.98")}, [28->16]) = 100
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("35.196.244.138")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
We are still baffled by this @idet2
All the hosts file is supposed to do, is override DNS for name resolution. The only real difference between the two traceroutes with and without the entry in hosts is in the last line. When you don’t have an entry, traceroute tries to do a reverse IP lookup and shows the PTR record that is owned by Google for that IP. With the entry in the hosts file, it figures out it doesn’t need to do that and just shows app.netdata.cloud, but it’s the same machine, either way.
I replicated your results with the windows tracert command and verified my suspicion that it skips the reverse IP lookup (see also windows - Where does tracert get its FQDN? And why can they differ? - Server Fault)
Let’s see if Konstantinos or anyone else can make sense of this.
All my agents run as a standalone process. Can you elaborate on how to run the agent under strace? I’m not familiar.
Ryan DiFrancesco
NYU Division of Libraries
Manager, IT Systems
212.998.2493
Regarding the “downgrade” procedure, I don’t see us having anything in the docs. I did find a workaround using kickstart.sh though:
- Download kickstart.sh but don’t run it
- Open the file and look for “latest=”. You will find it under a check for the “stable” version option.
- Add another line underneath it, to specify the version you want. e.g. “latest=v1.30.1”
- chmod the script and run it with “./kickstart.sh --stable-channel --reinstall”.
Trying the new ACLK implementation can’t be combined with downgrading, so it’s either one or the other. To use the new ACLK, you just call the unedited kickstart.sh with “–aclk-ng”. If it is successfully installed, you will see at localhost:19999/api/v1/info a parameter called _aclk_impl with the value “Next Generation”
@Christopher_Akritid1 : Right now I am at “netdata v1.31.0-41-nightly”.
How many versions back should I go? Any preferable? Unfortunately I don’t have any idea when this was broken
We are not aware of any change that could have anything to do with this, so it’s hard to say which version to go to. But I may be able to see a history of versions your agents were able to connect with in the past , just send tochris@netdata.cloud the URL you see when you are logged in. Part of it contains a unique identifier for your space or room.
Good morning!
Since the problem appeared I have deleted my War Rooms, Spaces etc. so it would be a mess to try and find it.
Therefore I took action on my side and tried previous versions as you have suggested.
Failure
Version: netdata v1.31.0
Configure options: ‘–prefix=/usr’ ‘–sysconfdir=/etc’ ‘–localstatedir=/var’ ‘–libexecdir=/usr/libexec’ ‘–libdir=/usr/lib’ ‘–with-zlib’ ‘–with-math’ ‘–with-user=netdata’ ‘–with-bundled-lws=externaldeps/libwebsockets’ ‘–with-libJudy=externaldeps/libJudy’ ‘CFLAGS=-O2’ ‘LDFLAGS=’
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
Cloud Implementation: Legacy
TLS Host Verification: YES
Libraries:
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
LWS: YES static v3.2.2
mosquitto: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: NO
Success
Version: netdata v1.30.1
Configure options: ‘–prefix=/usr’ ‘–sysconfdir=/etc’ ‘–localstatedir=/var’ ‘–libexecdir=/usr/libexec’ ‘–libdir=/usr/lib’ ‘–with-zlib’ ‘–with-math’ ‘–with-user=netdata’ ‘–with-bundled-lws=externaldeps/libwebsockets’ ‘–with-libJudy=externaldeps/libJudy’ ‘CFLAGS=-O2’ ‘LDFLAGS=’
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
Cloud Implementation: Legacy
TLS Host Verification: YES
Libraries:
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
LWS: YES static v3.2.2
mosquitto: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: NO
So it seems that v1.30.1 is working… I will go through the commits to see if anything makes sense but probably the one mentioned earlier by @underhood regarding IPv6 in LibWebSockets looks like a “suspect”.
EDIT: Forget about that … I see that from v1.30.1 to v1.31.0 there are multiple commits affecting ACLK. Besides the ones with the typos could any of the rest be blamed for the issue? I mean in conjunction with the fact that “/etc/hosts” works could any of the changes be affected by the reverse name ?
@Ryan_S_Di_Francesco : Could you please try as @Christopher_Akritid1 suggested and see if going back to v1.30.1 works for you also?
Well git log v1.30.0...v1.30.1 --oneline
show this (I added stars to anything that I think can affect ACLK - there is not much)
git log v1.30.0...v1.30.1 --oneline
d5ea3cc5c (tag: v1.30.1)
[ci skip] release v1.30.1
ed541dd4e [netdata patch release] v1.30.1
8bc32ad6e [ci skip] create nightly packages and update changelog
53b51bc80 Don’t use glob expansion in argument to `cd` in updater. (#10936)
e8f3c58be [ci skip] create nightly packages and update changelog
f800a4f3a Add a CRASH event when the agent fails to properly shutdown (#10893)
ee64ef04f Fix memory corruption issue when executing context queries in RAM/SAVE memory mode (#10933)
81391f0b6 Update CODEOWNERS (#10928)
0ca007837 Fix incorrect health log entries (#10822)
f85c28edf [ci skip] create nightly packages and update changelog
f6ff3ec9e Update README.md (#10898)
47a4ac71e Update news and GIF in README, fix typo (#10900)
25be8b833 Add libzstd-dev (#10925)
a389f1a17 [ci skip] create nightly packages and update changelog
***fbc69e721 Bump version of OpenSSL bundled in static builds to 1.1.1k (#10884)
f39406c9b Spelling build (#10428)
bc1ff185b Properly handle binary package reuplods. (#10878)
0068b7c11 [ci skip] create nightly packages and update changelog
***0df192eb7 Fixed bundling of ACLK-NG components in dist tarballs. (#10894)
bc9ce7adc [ci skip] create nightly packages and update changelog
Hi @underhood !
I guess you did a typo… v1.30.1 is the one that is working (you have it correct) but you should compare it against v1.31.0 (which is not working) and NOT v1.30.0
A quick check on the releases page revealed the following:
- Implement ACLK env endpoint #10833
- Implement new HTTPS client for ACLK #10805
- Update ACLK passwd endpoint to match specifications of the new architecture #10859
- Implement ACLK new backoff (TBEB) architecture #10941
- Some other with Spelling, unit conversion and charts (but I think they are irrelevant)
The good thing here is that you @underhood are probably the most appropriate person to help us with this since you appear to be the person that implemented/pushed the changes
Let me know on what you think of
Hi idet2,
thanks for pointing that out… need to be more careful…
hmmm… #10833, #10805, #10859, #10941 should normally touch ACLK-NG only, your buildinfo seems to indicate you use ACLK Legacy (which uses libwebsockets for HTTPS communication instead of my code (seen in those PRs you mentioned)).
I will check those PRs to make sure they accidentally don’t change anything in aclk/legacy
(which is code that you use).
ACLK-NG removes libwebsockets (which has proven problematic for us multiple times) and uses our own HTTPS client implementation which is one of the reasons I wanted to try to see what happens if you use NG.
However if you use aclk legacy you are not using any of that code.
I see what you are saying.
For installation I have just used the one line installation script as described in the docs which is:
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
I wasn’t aware of the ACLK-NG at the moment nor of the problems that you are facing with LWS.
I can try to switch to ACLK-NG if you guys think that this is more stable and better, no problem here.
Additionally, if LWS are causing so much trouble then maybe you should consider to drop it completely
But the question for me still remains and I would like to see a solid solution (since it can happen again in the future or somebody else may face it):
What has changed from v1.30.1 to v1.31.0 and the latter is not functioning (out of the blue) without the “/etc/hosts” entry?
Absolutelly! I will go trough the commits… Would do git bisect
to find exact breaking commit but I can’t reproduce it here
I know…that’s weird though… I mean so far the only difference that we have from a working and a not working is the static pointing to “app.netdata.cloud” via “/etc/hosts”. Maybe your network team has already something in place so there isn’t any DNS lookups… I am afraid I cannot help you further with that but if you think of something that I can do please let me know…
Maybe your network team has already something in place so there aren’t any DNS lookups
that is an interesting idea but unfortunately can’t be it as we are a fully distributed team… My work PC is not managed by any IT department or anything like that. DNS used is that of my DSL provider (which is different for pretty much everyone else as we are in different countries). no netdata IP in /etc/hosts
.
Thanks for all the help. Will try to figure this out.
Based on the strace in Nodes unreachable (errno 99, Cannot assign requested address) - #43 by idet2 it looks like DNS resolution is working alright.
You can see the connection attempt to 35.196.244.138 (which is the correct IP for app.netdata.cloud), after a DNS resolution.
Coming back into this post after a few more days of testing and trial and error.
Here are my new findings and comments for you to review:
-
Creating a new VM from scratch on a different Cloud provider and the VM was successfully connected
-
Creating a new VM from scratch on the same Cloud provider and at the same Datacenter (different IP range though) and the VM was also successfully connected
-
Removed completely NetData (using the uninstaller script) and reinstall it with the one line installation “kickstart” script from scratch I was not able to connect with the aforementioned errors for one more time. Setting the “/etc/hosts” variable fixed the issue.
-
Removed completely NetData (using the uninstaller script) and reinstall it with the one line installation “kickstart” script from scratch only this time I have included the “–aclk-ng” parameter…I was not able to see it online but look below for more information, since it is weird.
Verified that the “Next Generation” was set in the “_aclk_impl” parameter as provided by the “/api/v1/info”.
What I could see in the error.log was:
Error Log Excerpt1 (without /etc/hosts)
2021-06-12 10:43:33: netdata INFO : ACLK_Main : Attempting connection now
2021-06-12 10:43:33: netdata INFO : ACLK_Stats : thread created with task id 1148716
2021-06-12 10:43:33: netdata INFO : ACLK_Stats : set name of thread 1148716 to ACLK_Stats
2021-06-12 10:43:33: netdata INFO : ACLK_Main : HTTPS “GET” request to “app.netdata.cloud” finished with HTTP code: 200
2021-06-12 10:43:33: netdata INFO : ACLK_Main : Getting Cloud /env successful
2021-06-12 10:43:34: netdata INFO : ACLK_Main : HTTPS “GET” request to “app.netdata.cloud” finished with HTTP code: 409
2021-06-12 10:43:34: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409)
2021-06-12 10:43:34: netdata ERROR : ACLK_Main : Cloud returned EC=“TODO trace-id”, Msg-Key:“ErrDuplicatedChallenge”, Msg:“delay retry 1m0s: duplicated challenge”, BlockRetry:false, Backoff:60s (-1 unset by cloud)
2021-06-12 10:43:34: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP
2021-06-12 10:43:34: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 60.000 seconds
Doing a restart of the service and the logs now become:
Error Log Excerpt2 (without /etc/hosts)
2021-06-12 10:59:25: netdata INFO : ACLK_Main : Attempting connection now
2021-06-12 10:59:25: netdata INFO : ACLK_Stats : thread created with task id 1153722
2021-06-12 10:59:25: netdata INFO : ACLK_Stats : set name of thread 1153722 to ACLK_Stats
2021-06-12 10:59:25: netdata INFO : ACLK_Main : HTTPS “GET” request to “app.netdata.cloud” finished with HTTP code: 200
2021-06-12 10:59:25: netdata INFO : ACLK_Main : Getting Cloud /env successful
2021-06-12 10:59:26: netdata INFO : ACLK_Main : HTTPS “GET” request to “app.netdata.cloud” finished with HTTP code: 200
2021-06-12 10:59:26: netdata INFO : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2021-06-12 10:59:26: netdata INFO : ACLK_Main : HTTPS “POST” request to “app.netdata.cloud” finished with HTTP code: 201
2021-06-12 10:59:26: netdata INFO : ACLK_Main : ACLK_OTP Got Password from Cloud
2021-06-12 10:59:26: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “agent-connection”
2021-06-12 10:59:26: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “create-node-instance”
2021-06-12 10:59:26: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “node-instance-connection”
2021-06-12 10:59:26: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “inbox-cmd-v1”
2021-06-12 10:59:26: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “chart-and-dims-updated”
2021-06-12 10:59:26: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “reset-charts”
2021-06-12 10:59:26: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “chart-configs-updated”
2021-06-12 10:59:27: netdata INFO : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2021-06-12 10:59:27: netdata INFO : ACLK_Main : MQTTWSS connection succeeded
2021-06-12 10:59:27: netdata ERROR : ACLK_Main : ACLK localhost popocorn wait 1 seconds longer
2021-06-12 10:59:28: netdata ERROR : ACLK_Main : ACLK localhost popocorn finished
2021-06-12 10:59:28: netdata INFO : ACLK_Main : Starting 2 query threads.
Which means that the host should be available.
Although in the Web Page of Netdata I could see the host as “unreachable” only which is weird (to say at least). Since at the same time I was getting e-mail alerts about the “10s_ipv4_tcp_resets_received” warning and clear.
Setting the “/etc/hosts” produces the same weird behavior since it appears to be connected but still on my browser I see it as unreachable after waiting for many minutes.
Error Log Excerpt (with /etc/hosts)
2021-06-12 10:46:29: netdata INFO : ACLK_Main : Attempting connection now
2021-06-12 10:46:29: netdata INFO : ACLK_Stats : thread created with task id 1150226
2021-06-12 10:46:29: netdata INFO : ACLK_Stats : set name of thread 1150226 to ACLK_Stats
2021-06-12 10:46:29: netdata INFO : ACLK_Main : HTTPS “GET” request to “app.netdata.cloud” finished with HTTP code: 200
2021-06-12 10:46:29: netdata INFO : ACLK_Main : Getting Cloud /env successful
2021-06-12 10:46:30: netdata INFO : ACLK_Main : HTTPS “GET” request to “app.netdata.cloud” finished with HTTP code: 200
2021-06-12 10:46:30: netdata INFO : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2021-06-12 10:46:30: netdata INFO : ACLK_Main : HTTPS “POST” request to “app.netdata.cloud” finished with HTTP code: 201
2021-06-12 10:46:30: netdata INFO : ACLK_Main : ACLK_OTP Got Password from Cloud
2021-06-12 10:46:30: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “agent-connection”
2021-06-12 10:46:30: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “create-node-instance”
2021-06-12 10:46:30: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “node-instance-connection”
2021-06-12 10:46:30: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “inbox-cmd-v1”
2021-06-12 10:46:30: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “chart-and-dims-updated”
2021-06-12 10:46:30: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “reset-charts”
2021-06-12 10:46:30: netdata INFO : ACLK_Main : topic dictionary has unknown topic name “chart-configs-updated”
2021-06-12 10:46:31: netdata INFO : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2021-06-12 10:46:31: netdata INFO : ACLK_Main : MQTTWSS connection succeeded
2021-06-12 10:46:31: netdata ERROR : ACLK_Main : ACLK localhost popocorn wait 1 seconds longer
2021-06-12 10:46:32: netdata ERROR : ACLK_Main : ACLK localhost popocorn finished
I am busting my head here people… Any other thoughts or suggestions would be awesome.
I have changed to “Next Generation” ACLK as suggested and this time although from the logs I seem to be connected and I am getting “e-mail alerts” at the same time the Web is showing the host as “unreachable”.
What are the chances of something being broken with previous versions for my account in the backend?
I mean I am really tempted to downgrade with “aclk-ng” enabled to see what the hell will happen but I would like your comments first.
So, do you have any ideas @underhood , @Christopher_Akritid1 , @Konstantinos_Natsaki ?
EDIT : Forgot to say that the “unavailable” on the web could be seen from different browsers (Firefox and Chrome) using both normal and private browsing.