Netdata Community

Monitoring glusterfs

Hi all,

I’m using netdata to monitor an HPC setup, using gluster as the storage backend. It seems that Netdata is not able to monitor gluster mountpoints. hopefully I have misunderstood this?

BR Kasper

Yes absolutely,

I will see to come up with something one of these days.

Cheers!

That can work too, but we need to move the existing bug reports from the internal “product” repo, because we can’t make that public.

I think I would prefer a public Netdata Cloud repo, since we are moving to a paradigm where bugs/PRs about the agent are tracked on Github while the rest of the discussion happens on the forum.
What do you think @christopher-akritidis?

Category in our community forum. Ideally we’d have a two way integration with the internal GitHub “product” repository for issues labeled as bugs, but I haven’t looked into whether something like this exists.

Thanks @kasper-lund and @leonidas-vrachnis for the queue, it’s a very interesting topic indeed.

I feel that the best course would be a Github repository, it makes sense as it will be the same platform for Agent. What do you think @manos-saratsis? We will need to setup a common understanding/process, that bugs that are identified in the Product repo (internal private repo for cloud) are “cloned” in the public repo and then linked here. This cloning could be done manually or automatically using a special label “public” and the Github API.

Another way would be to use a specific Discourse Category, and each issue will be a topic, but I don’t love it.

What everybody thinks?

@Kasper-Lund said in Monitoring glusterfs:

I have now rebooted the host, as we had a service window for patching this morning. Everything looks to be ok in the cloud view now.

Happy to hear that the mitigation works

Is there any place I can follow current issues like this? I tried the “unsolved” page, but this does not have any topics related to cloud sync issues.

Indeed the cloud team doesn’t maintain a public issue tracking page. We need to figure out the best way to do this. @OdysLam any suggestions?

Hi Leonidas,

Thank for replying. I have now rebooted the host, as we had a service window for patching this morning. Everything looks to be ok in the cloud view now.
Is there any place I can follow current issues like this? I tried the “unsolved” page, but this does not have any topics related to cloud sync issues.

BR Kasper

@Kasper-Lund We are aware that under some conditions active alarms could stuck in Cloud. Is not a problem with your local browser cache, but rather with the way we sync agent alarms to our cloud backed.

I would suggest to try and restart your agent. When an agent disconnects we clear all the active alarms associated with it, so by restarting it effectively you are doing a “hard reset”. If you try this and it doesn’t work please ping us, so we can investigate further.

I am deeply sorry about this issue. We are on it!

Thanks @manos-saratsis for replying so swiftly :slight_smile:

@Kasper-Lund Really sorry for this. We are aware and working to make the alarms more reliable in Cloud. Thank you for your patience.

1 Like

Hi OdysLam,

I am using chrome and have tried deleting cache and doing a “CTRL + F5”, I still get a lot of alarms on the cloud version.
Also, I have not been accessing the site (or used my browser at all) since last week. So the alarms should have no way of being old cached data.

/Kasper

Hey @kasper-lund,

Indeed, we are having a very hard-to-pinpoint bug where cloud and agent become out-of-sync in terms of alarm. This is an intermittent issue, thus we can’t reliably reproduce it, but we are working towards a fix. Please bear with us and thank you for your feedback!

Could you try to empty your cache and do a hard reload?

Hi, a short update.
One week later the cloud version has got the gluster mount point included, without me doing anything.
But, now the cloud version has 42 critical alarms, stating that data collection from our postgres DB has not happened for 31 hours.
Looking at the local version, I have no postgres alarms, and everything is OK.
My impression is that the cloud version is very unreliable, and I must admit that seeing 42 critical alarms on our main production database made my heart take an extra beat.
Is there anything I can do to make this more reliable?

BR Kasper

Hi Again,
I am facing this issue at a customer (I am a consultant) so I will not be able to test before tuesday next week. I will test when I have the chance, and i will let you know. Have a nice weekend.
/Kasper

Hey Kasper-Lund,

It’s indeed a bug, can you please try to clear the cache of the browser and then perform a Hard Reload. This is a band-aid fix while we are working on the underlying issue.

Thanks for your patience and please do tell me if it all works in the end!

@ilyam8, just saw your answer. We said the same thing :joy:

Hey @kasper-lund,

The cloud and agent are in 100% sync and this is most probably a bug. I am communicating internally and we will respond shortly.

Thank you for your patience, we will get to the bottom of this in a timely manner :muscle:

Hi again,

Thank you for welcoming me, and thank you for replying to my post.

I took a look at the netdata.conf file, and first of all, everything is commented, so I assume, I am using some sort of default settings, where can i see the settings i am using?

The exclusion line containing gluster: # exclude space metrics on filesystems = *gvfs gluster *s3fs *ipfs *davfs2 *httpfs *sshfs *gdfs *moosefs fusectl
is commented like everything else, it’s a little confusing the gues what configuration I am actually rinning on.

I also noticed that the /mnt/gluster section was set to “no” (but still commented) in space usage and inodes usage. I uncommented this and changed “no” to “auto”, and restarted the service.
Now I see the the gluster mount point accessing the local netdata (ip:19999), but I still don’t see it when accessing the cloud interface? I thought the cloud portal was 100% in sync with the local instance.

Hope you can clear out some things for me.

BR Kasper

Hi @kasper-lund,

Our documentation at https://learn.netdata.cloud/docs/agent/collectors/diskspace.plugin appears to mention gluster. DId you try it and it did not work?

BTW, welcome to our community!