My dashboards keep "dying" and becoming useless so I have to recreate them. How can I stop this happening?

I use NetData to monitor my production app but don’t check it that often (couple of times a month) outside of when we get issues reported. I’ve been using it over a year but I’ve found in the last six months the dashboards that I create become broken after a short time and I have to recreate them. It’s very annoying and to be honest is spoiling the whole tool for me at the moment.

Here is an example of what one of these dashboards looks like:

The node is fine, I can see it in the nodes list and it’s reporting metrics. The dashboard just shows nothing. In the console there are a bunch of 404 errors like this:

Seems like the dashboard entries have gone?

Cheers
Steve

Hi,

It’s good that you don’t check Netdata that often, it means that your systems are working optimally :v:

Just to verify my understanding, you mean that the custom dashboards that you create through cloud become "unusable’ as you show in the screenshot after a random amount of time?

Haha yeah :]

When I first set up ND I created a bunch of dashboards. They worked fine for ages (months) until a recent(ish) update. It was around the time that the non-white UI came in I think. Anyway, I’ll create the custom dashboards, usually only connected to a single node, and they work fine. To confirm they are on the ND Cloud product. Fast forward an unknown amount of time (maybe a couple of weeks) and they will stop working.

I’ve just gone and checked and currently it’s only affecting 2 of my nodes, but again I haven’t changed anything and the nodes are both accessible via the Node tab.

Screenshot of all my dashboards, X marked are completely unusable, only single node. \ marked are partially unusable and not showing data to the affected nodes.

Example of one that is half working:

Let me know if you need any more information.

Just checked and ALL my nodes are running netdata v1.26.0-333-nightly.

Is it possible to update 1 node and see if it fixes the issue?

In any case, this is troubling.

Thank you for being so thorough with your analysis.

Hello @swhitf.

I looked into this and the agent you have in the dashboard has been re-claimed. I can see multiple ids for the same hostname.

Last time we heard about the particular id was 2020-10-03 15:59:20.660894+00:00 UTC

re-claimed could happen automatically in some environments.

1 Like

So, if you reclaim a node, it’s basically a new agent, thus the dashboards will no longer work, right?

new agent from the PoV of the cloud

Yes, @OdysLam. It is something we need to improve. Soon we would be able to better detect that it is the same node, even if it was claimed from scratch (using the machine_guid).

@swhitf Is it possible that you run the claiming script on machine startup, or as part of another script?

Now that I think about it, I might have had an issue with this node being “not reachable” and tried reclaiming to get it working again, I have a vague memory of something like this. Given how fast time appears to be moving in my life at the moment its also possible I’m way off with my timelines.

So basically, if I reclaim, I have to recreate the dashboards for the respective node? If that’s the case at least I know why it’s happening and will be more mindful next time. If it happens again and I definitely didn’t reclaim myself I’ll be back in touch. I’m not running it automatically anywhere.

Anyway, thanks for your prompt responses, great support. Love the product also, waiting for the day I can pay for it!

Cheers
Steve

Thanks, for the kind words

So basically, if I reclaim, I have to recreate the dashboards for the respective node?

Yes, this is the case for now. We already plan to improve this behavior, so even after a reclaim, the cloud would detect it is the same node.

Also, we will communicate better that the particular node is actually offline, instead of this misleading loading graphic. Thanks for bringing this to our attention.

1 Like

Back again. I just tried changing the affected dashboard and had issues saving. I think it’s possible that this bug actually happened to me a few times without me realising and its why I kept thinking it was re-happening.

Steps to reproduce are in a video here:

TL/DW though is that if you have a dashboard made up of only charts from a dead node, you can’t ever get it to save and have to delete and re-create.

Let me know if you have any questions, though I’m literally going on holiday in 5 minutes so I won’t reply until next week. Laters!

Steve

Hey Steve,

Could you please verify from the network tab, that the PATCH request is getting a 409 conflict error?

Thanks,
Johnny

From our logs I have detected 14 warnings when you tried to update your dashboard. The reason for those is “dashboard version conflict”. Around “2021-05-27T18:47:16.925136065Z” (UTC)

It can happen when you tried to edit a dashboard that is already edited by another session. A simple page refresh should fix it.

We probably don’t handle it correctly.

@novykh I’ve replicate it, we are not handling 409 correctly

Also, we are missing a notification there. Will be added in the next release.
@swhitf thanks

It seems we have another issue, us the dashboard version can not be updated even after a page refresh.

We you will check it out.

@novykh We have a caching issue as well, when I refresh the page (after I modified the dashboard in an other tab), the request returned with 304.

Cache related headers:

cache-control: max-age=0
if-modified-since:Wed, 19 May 2021 11:45:16 GMT
if-none-match: "60a4fa4c-a93"

We need to review our caching policies and middlewares.

If I disable the cache on chrome it works ok.

@swhitf A hard-refresh shift+f5 should be enough as a quick fix. Until will fix it on our end.

We haven’t pay much attention to dashboards lately, but your feedback was really valuable for pinpoint all those issues. We will you posted with the fixes.