500 error loading one space (but not others)

One of our spaces, Netdata Cloud now just presents a “500, oops something unexpected happened” error when loading it. That happens on the “overview” page or any of the individual node pages; the “Nodes” tab shows the node list (but no data).

I’ve been working on upgrading our Netdata agents (we currently use the ones from packagecloud, but we’re still on 16.04 so that’s no longer an option; I’m building a .deb on our CI instead). Not sure if that has anything to do with this.

There are some errors in the Chromium console (I’ve tried this with several browsers). There are two agents in this space; one is from the new package I’ve built and one is the from-source build; both show connecting with the new protobuf API in there error.log.

Console:
Too long to paste here, exceeds length limit. Put on Pastebin: Netdata Chromium console - Pastebin.com

grep ACLK error.log:

2022-03-07 15:20:17: netdata INFO  : MAIN : Starting ACLK sync thread for host bbc44530-ac34-11eb-aaf6-0ad3686b6741 -- scratch area 786952 bytes
2022-03-07 15:20:23: netdata INFO  : ACLK_Main : thread created with task id 30512
2022-03-07 15:20:23: netdata INFO  : ACLK_Main : set name of thread 30512 to ACLK_Main
2022-03-07 15:20:23: netdata INFO  : ACLK_Main : Waiting for Cloud to be enabled
2022-03-07 15:20:28: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-03-07 15:20:28: netdata INFO  : ACLK_Main : Attempting connection now
2022-03-07 15:20:28: netdata INFO  : ACLK_Stats : thread created with task id 30761
2022-03-07 15:20:28: netdata INFO  : ACLK_Stats : set name of thread 30761 to ACLK_Stats
2022-03-07 15:20:33: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-03-07 15:20:33: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-03-07 15:20:33: netdata INFO  : ACLK_Main : Switching ACLK to new protobuf protocol. Due to /env response.
2022-03-07 15:20:34: netdata INFO  : ACLK_Main : HTTPS "GET" request to "api.netdata.cloud" finished with HTTP code: 200
2022-03-07 15:20:34: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-03-07 15:20:34: netdata INFO  : ACLK_Main : HTTPS "POST" request to "api.netdata.cloud" finished with HTTP code: 201
2022-03-07 15:20:34: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-03-07 15:20:34: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-03-07 15:20:35: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-03-07 15:20:35: netdata INFO  : ACLK_Main : Starting 2 query threads.
2022-03-07 15:20:35: netdata INFO  : ACLK_Query_0 : thread created with task id 30773
2022-03-07 15:20:35: netdata INFO  : ACLK_Query_0 : set name of thread 30773 to ACLK_Query_0
2022-03-07 15:20:35: netdata INFO  : ACLK_Query_1 : thread created with task id 30774
2022-03-07 15:20:35: netdata INFO  : ACLK_Query_1 : set name of thread 30774 to ACLK_Query_1
2022-03-07 15:20:35: netdata INFO  : ACLK_Main : Queuing status update for node=60f04e26-267f-414d-a195-2be94e7cea49, live=1, hops=0

The space (or I guess its General war room) is now showing no nodes. I’m guessing that means someone is working on it? Going to leave it alone so I don’t get in your way.

It was still like that this afternoon, so I re-added the two nodes to the “General” war room. Looks good.

Several spaces & war rooms are giving 500 errors on the overview page; some war rooms the dash boards work, others don’t (giving 500 errors); I can’t even create new ones in most war rooms (even a new, blank dashboard gives a 500).

Creating new war rooms doesn’t help, they give 500s too.

I’m seeing this on my personal (non-work) account too…

Is there something I can do to help track down why this is broken?

Using the browser inspector, the only failure I see is its getting an unsupported agent error:

POST /api/v1/spaces/51175f99-c379-4a47-a963-176e5c52e1bd/rooms/fef2661e-2e16-46ce-8c8f-d9a5ce192e75/charts HTTP/1.1
Host: app.netdata.cloud
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Content-Type: application/json
Content-Length: 87
Origin: https://app.netdata.cloud
Connection: keep-alive
Referer: https://app.netdata.cloud/spaces/antz-qa/rooms/test2/overview
⋮
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
DNT: 1
Sec-GPC: 1

gets:

HTTP/2 200 OK
content-length: 195
content-type: application/json; charset=utf-8
date: Wed, 16 Mar 2022 18:32:48 GMT
netdata-request-id: uohY73Xcw4-10661918
vary: Accept-Encoding
X-Firefox-Spdy: h2

{"results":{},"errors":[{"errorMsgKey":"ErrUnsupportedAgentVersion","errorMessage":"unsupported agent version","errorCode":"uohY73Xcw4-10661918","nodeID":"40fa7c68-faad-4801-bbc0-0a67dd7cd9a9"}]}

But the node is running 1.33.1 (our build), and pulling up the node from the Nodes page works fine.

I tried re-adding the node and it didn’t help.

Hi @derobert_work ,

Thank you for sharing all these updates and trying out different approaches to workaround the issue.

Will follow-up this internally with the team to try to understand what could potentially be the issue.
If some additional information is needed from your end will make sure to reach out.

Regards,
Hugo

Hi @derobert_work ,
Could you please provide the output of the commands (below) for the agent of your own build?
commands: netdata -W buildinfo OR netdata -v
Thanks,
odynik

1 Like

Hello @derobert_work,

First of all thank you for using Netdata and for bringing this to our attention,

Your Cloud issue must have been resolved by now, there was a slight validation issue which will soon address with a fix.
Please let us know if indeed your space now works as expected.

Thanks!

Sure:

Version: netdata gtl-v1.33.1
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security'
Install type: custom
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         NO
Libraries:
    protobuf:                YES (bundled)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    NO
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

Our QA space is working again now, thank you! The staging and production spaces are still having issues.

Hi @derobert_work ,

Thanks for confirming that all is good with your QA space, we will also look into your staging and production spaces and share any updates here.

Regards,
Hugo

Hi @derobert_work ,

The action we performed on your QA space has also been done for your Staging and Production spaces.
This actions is related with our roll-out to a New Architecture - you can check some details on what that entails here.

Your spaces have all been migrated to the New Architecture and everything should be working for you.

I just wanted to mention also that we noticed that on your Production space you have about 7 nodes that are on v.13.1 and not proto-capable so please upgrade them as soon as you can.
You should see a banner on the top of your Netdata Cloud page highlighting which nodes are these.

If you need us to help on identifying those nodes or you face any issue let us know.

Thanks again for using Netdata and bringing this to our attention!

Regards,
Hugo

@hugo Thank you, that seems to have fixed everything. The one exception is anything in the Production space’s “General” room. The “Overview” tab doesn’t work there (and really, I don’t expect it to; I’d turn it off, if that were an option, and only have the “Nodes” tab). Somewhat surprisingly, but not really a problem for us, the dashboards don’t work there either; looks like a request to https://app.netdata.cloud/api/v2/spaces/57be3c29-d3e0-4ee6-a2e3-1cf74faa8183/rooms/2750cbb2-8cfb-48ed-8ecf-bf9cd71ed8f3/charts is timing out (getting Bad Gateway after ≈24s), because it’s trying to get data on all 518 nodes…

To be clear—I think it’d be perfectly reasonable to just turn the Overview tab off, and maybe even Dashboards, for war rooms with hundreds of nodes. Sure, it was cool to see aggregate traffic across our entire network, but it wasn’t actually useful (and loading that page always had the browser peg the CPU when it was working).

We’ve been through our un-upgraded nodes. They’re ones where the automation is either broken or turned off. If we still have one or two broken as of April 4th, I’m fine with them no longer working until we get them fixed.

Let us take a better look to what’s happening on that request on the “Nodes” tab.

You are totally right, the load it takes to get the data for the Overview page on big War Rooms is really too much. We are on top of this and planning to introduce a different Overview page with some summary statistics of the War Room around: Nodes, Data Collection, Data Replication, etc.

Being you a user with such a use case, having hundreds of Nodes, we would love to be able to connect with to gather some feedback, suggestions, pain-points, etc.

Let me know if you would interested.

Regards,
Hugo

I got some emails from Christopher Akritidis asking about a meeting too. I’d like to gather up some more folks here, but they’re involved in a data center move at the moment, so time is scarcer than normal. I’m going to try to get it scheduled eventually.

3 Likes