Monitoring glusterfs

Kasper_Lund · October 28, 2020, 12:33pm

Hi all,

I’m using netdata to monitor an HPC setup, using gluster as the storage backend. It seems that Netdata is not able to monitor gluster mountpoints. hopefully I have misunderstood this?

BR Kasper

OdysLam · November 6, 2020, 7:29am

Yes absolutely,

I will see to come up with something one of these days.

Cheers!

Christopher_Akritid1 · November 5, 2020, 7:57pm

That can work too, but we need to move the existing bug reports from the internal “product” repo, because we can’t make that public.

OdysLam · November 5, 2020, 6:59pm

I think I would prefer a public Netdata Cloud repo, since we are moving to a paradigm where bugs/PRs about the agent are tracked on Github while the rest of the discussion happens on the forum.
What do you think @christopher-akritidis?

Christopher_Akritid1 · November 5, 2020, 6:49pm

Category in our community forum. Ideally we’d have a two way integration with the internal GitHub “product” repository for issues labeled as bugs, but I haven’t looked into whether something like this exists.

OdysLam · November 4, 2020, 5:35pm

Thanks @kasper-lund and @leonidas-vrachnis for the queue, it’s a very interesting topic indeed.

I feel that the best course would be a Github repository, it makes sense as it will be the same platform for Agent. What do you think @manos-saratsis? We will need to setup a common understanding/process, that bugs that are identified in the Product repo (internal private repo for cloud) are “cloned” in the public repo and then linked here. This cloning could be done manually or automatically using a special label “public” and the Github API.

Another way would be to use a specific Discourse Category, and each issue will be a topic, but I don’t love it.

What everybody thinks?

Leonidas_Vrachnis · November 4, 2020, 5:27pm

@Kasper-Lund said in Monitoring glusterfs:

I have now rebooted the host, as we had a service window for patching this morning. Everything looks to be ok in the cloud view now.

Happy to hear that the mitigation works

Is there any place I can follow current issues like this? I tried the “unsolved” page, but this does not have any topics related to cloud sync issues.

Indeed the cloud team doesn’t maintain a public issue tracking page. We need to figure out the best way to do this. @OdysLam any suggestions?

Kasper_Lund · November 4, 2020, 12:56pm

Hi Leonidas,

Thank for replying. I have now rebooted the host, as we had a service window for patching this morning. Everything looks to be ok in the cloud view now.
Is there any place I can follow current issues like this? I tried the “unsolved” page, but this does not have any topics related to cloud sync issues.

BR Kasper

Leonidas_Vrachnis · November 4, 2020, 10:47am

@Kasper-Lund We are aware that under some conditions active alarms could stuck in Cloud. Is not a problem with your local browser cache, but rather with the way we sync agent alarms to our cloud backed.

I would suggest to try and restart your agent. When an agent disconnects we clear all the active alarms associated with it, so by restarting it effectively you are doing a “hard reset”. If you try this and it doesn’t work please ping us, so we can investigate further.

I am deeply sorry about this issue. We are on it!

OdysLam · November 4, 2020, 10:34am

Thanks @manos-saratsis for replying so swiftly

Manos_Saratsis · November 4, 2020, 10:15am

@Kasper-Lund Really sorry for this. We are aware and working to make the alarms more reliable in Cloud. Thank you for your patience.

Kasper_Lund · November 4, 2020, 9:42am

Hi OdysLam,

I am using chrome and have tried deleting cache and doing a “CTRL + F5”, I still get a lot of alarms on the cloud version.
Also, I have not been accessing the site (or used my browser at all) since last week. So the alarms should have no way of being old cached data.

/Kasper

OdysLam · November 4, 2020, 8:40am

Hey @kasper-lund,

Indeed, we are having a very hard-to-pinpoint bug where cloud and agent become out-of-sync in terms of alarm. This is an intermittent issue, thus we can’t reliably reproduce it, but we are working towards a fix. Please bear with us and thank you for your feedback!

Could you try to empty your cache and do a hard reload?

Kasper_Lund · November 4, 2020, 8:11am

Hi, a short update.
One week later the cloud version has got the gluster mount point included, without me doing anything.
But, now the cloud version has 42 critical alarms, stating that data collection from our postgres DB has not happened for 31 hours.
Looking at the local version, I have no postgres alarms, and everything is OK.
My impression is that the cloud version is very unreliable, and I must admit that seeing 42 critical alarms on our main production database made my heart take an extra beat.
Is there anything I can do to make this more reliable?

BR Kasper

Kasper_Lund · October 30, 2020, 10:12am

Hi Again,
I am facing this issue at a customer (I am a consultant) so I will not be able to test before tuesday next week. I will test when I have the chance, and i will let you know. Have a nice weekend.
/Kasper

OdysLam · October 29, 2020, 4:39pm

Hey Kasper-Lund,

It’s indeed a bug, can you please try to clear the cache of the browser and then perform a Hard Reload. This is a band-aid fix while we are working on the underlying issue.

Thanks for your patience and please do tell me if it all works in the end!

OdysLam · October 29, 2020, 9:36am

@ilyam8, just saw your answer. We said the same thing

OdysLam · October 29, 2020, 9:35am

Hey @kasper-lund,

The cloud and agent are in 100% sync and this is most probably a bug. I am communicating internally and we will respond shortly.

Thank you for your patience, we will get to the bottom of this in a timely manner

Kasper_Lund · October 29, 2020, 7:27am

Hi again,

Thank you for welcoming me, and thank you for replying to my post.

I took a look at the netdata.conf file, and first of all, everything is commented, so I assume, I am using some sort of default settings, where can i see the settings i am using?

The exclusion line containing gluster: # exclude space metrics on filesystems = *gvfs gluster *s3fs *ipfs *davfs2 *httpfs *sshfs *gdfs *moosefs fusectl
is commented like everything else, it’s a little confusing the gues what configuration I am actually rinning on.

I also noticed that the /mnt/gluster section was set to “no” (but still commented) in space usage and inodes usage. I uncommented this and changed “no” to “auto”, and restarted the service.
Now I see the the gluster mount point accessing the local netdata (ip:19999), but I still don’t see it when accessing the cloud interface? I thought the cloud portal was 100% in sync with the local instance.

Hope you can clear out some things for me.

BR Kasper

OdysLam · October 28, 2020, 2:33pm

Hi @kasper-lund,

Our documentation at diskspace.plugin | Learn Netdata appears to mention gluster. DId you try it and it did not work?

BTW, welcome to our community!

Topic		Replies	Views
About the Help category Help agent	2	563	March 16, 2026
data missing from live node Help cloud	27	3055	June 20, 2022
Issue with adding nodes to the Netdata Cloud Help cloud	4	1308	November 26, 2020
No notifications in Netdata app Help cloud	8	478	March 5, 2024
Unable to see child pods/nodes in Netdata Cloud Help cloud	11	1951	September 15, 2021

Monitoring glusterfs

Related topics