One of our spaces, Netdata Cloud now just presents a “500, oops something unexpected happened” error when loading it. That happens on the “overview” page or any of the individual node pages; the “Nodes” tab shows the node list (but no data).
I’ve been working on upgrading our Netdata agents (we currently use the ones from packagecloud, but we’re still on 16.04 so that’s no longer an option; I’m building a .deb on our CI instead). Not sure if that has anything to do with this.
There are some errors in the Chromium console (I’ve tried this with several browsers). There are two agents in this space; one is from the new package I’ve built and one is the from-source build; both show connecting with the new protobuf API in there error.log.
2022-03-07 15:20:17: netdata INFO : MAIN : Starting ACLK sync thread for host bbc44530-ac34-11eb-aaf6-0ad3686b6741 -- scratch area 786952 bytes
2022-03-07 15:20:23: netdata INFO : ACLK_Main : thread created with task id 30512
2022-03-07 15:20:23: netdata INFO : ACLK_Main : set name of thread 30512 to ACLK_Main
2022-03-07 15:20:23: netdata INFO : ACLK_Main : Waiting for Cloud to be enabled
2022-03-07 15:20:28: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-03-07 15:20:28: netdata INFO : ACLK_Main : Attempting connection now
2022-03-07 15:20:28: netdata INFO : ACLK_Stats : thread created with task id 30761
2022-03-07 15:20:28: netdata INFO : ACLK_Stats : set name of thread 30761 to ACLK_Stats
2022-03-07 15:20:33: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-03-07 15:20:33: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-03-07 15:20:33: netdata INFO : ACLK_Main : Switching ACLK to new protobuf protocol. Due to /env response.
2022-03-07 15:20:34: netdata INFO : ACLK_Main : HTTPS "GET" request to "api.netdata.cloud" finished with HTTP code: 200
2022-03-07 15:20:34: netdata INFO : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-03-07 15:20:34: netdata INFO : ACLK_Main : HTTPS "POST" request to "api.netdata.cloud" finished with HTTP code: 201
2022-03-07 15:20:34: netdata INFO : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-03-07 15:20:34: netdata INFO : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-03-07 15:20:35: netdata INFO : ACLK_Main : ACLK connection successfully established
2022-03-07 15:20:35: netdata INFO : ACLK_Main : Starting 2 query threads.
2022-03-07 15:20:35: netdata INFO : ACLK_Query_0 : thread created with task id 30773
2022-03-07 15:20:35: netdata INFO : ACLK_Query_0 : set name of thread 30773 to ACLK_Query_0
2022-03-07 15:20:35: netdata INFO : ACLK_Query_1 : thread created with task id 30774
2022-03-07 15:20:35: netdata INFO : ACLK_Query_1 : set name of thread 30774 to ACLK_Query_1
2022-03-07 15:20:35: netdata INFO : ACLK_Main : Queuing status update for node=60f04e26-267f-414d-a195-2be94e7cea49, live=1, hops=0
The space (or I guess its General war room) is now showing no nodes. I’m guessing that means someone is working on it? Going to leave it alone so I don’t get in your way.
Several spaces & war rooms are giving 500 errors on the overview page; some war rooms the dash boards work, others don’t (giving 500 errors); I can’t even create new ones in most war rooms (even a new, blank dashboard gives a 500).
Creating new war rooms doesn’t help, they give 500s too.
I’m seeing this on my personal (non-work) account too…
Is there something I can do to help track down why this is broken?
Thank you for sharing all these updates and trying out different approaches to workaround the issue.
Will follow-up this internally with the team to try to understand what could potentially be the issue.
If some additional information is needed from your end will make sure to reach out.
Hi @derobert_work ,
Could you please provide the output of the commands (below) for the agent of your own build?
commands: netdata -W buildinfo OR netdata -v
Thanks,
odynik
First of all thank you for using Netdata and for bringing this to our attention,
Your Cloud issue must have been resolved by now, there was a slight validation issue which will soon address with a fix.
Please let us know if indeed your space now works as expected.
The action we performed on your QA space has also been done for your Staging and Production spaces.
This actions is related with our roll-out to a New Architecture - you can check some details on what that entails here.
Your spaces have all been migrated to the New Architecture and everything should be working for you.
I just wanted to mention also that we noticed that on your Production space you have about 7 nodes that are on v.13.1 and not proto-capable so please upgrade them as soon as you can.
You should see a banner on the top of your Netdata Cloud page highlighting which nodes are these.
If you need us to help on identifying those nodes or you face any issue let us know.
Thanks again for using Netdata and bringing this to our attention!
@hugo Thank you, that seems to have fixed everything. The one exception is anything in the Production space’s “General” room. The “Overview” tab doesn’t work there (and really, I don’t expect it to; I’d turn it off, if that were an option, and only have the “Nodes” tab). Somewhat surprisingly, but not really a problem for us, the dashboards don’t work there either; looks like a request to https://app.netdata.cloud/api/v2/spaces/57be3c29-d3e0-4ee6-a2e3-1cf74faa8183/rooms/2750cbb2-8cfb-48ed-8ecf-bf9cd71ed8f3/charts is timing out (getting Bad Gateway after ≈24s), because it’s trying to get data on all 518 nodes…
To be clear—I think it’d be perfectly reasonable to just turn the Overview tab off, and maybe even Dashboards, for war rooms with hundreds of nodes. Sure, it was cool to see aggregate traffic across our entire network, but it wasn’t actually useful (and loading that page always had the browser peg the CPU when it was working).
We’ve been through our un-upgraded nodes. They’re ones where the automation is either broken or turned off. If we still have one or two broken as of April 4th, I’m fine with them no longer working until we get them fixed.
Let us take a better look to what’s happening on that request on the “Nodes” tab.
You are totally right, the load it takes to get the data for the Overview page on big War Rooms is really too much. We are on top of this and planning to introduce a different Overview page with some summary statistics of the War Room around: Nodes, Data Collection, Data Replication, etc.
Being you a user with such a use case, having hundreds of Nodes, we would love to be able to connect with to gather some feedback, suggestions, pain-points, etc.
I got some emails from Christopher Akritidis asking about a meeting too. I’d like to gather up some more folks here, but they’re involved in a data center move at the moment, so time is scarcer than normal. I’m going to try to get it scheduled eventually.