Critical mdstat_disks alert on 4-bay Synology NAS with 3 drives (SHR1)

pk1966 · October 15, 2022, 9:23am

I have a Synology DS920+ 4-bay NAS, populated with 3 HDDs and configured using SHR1 (Synology’s Hybrid RAID with 1 disk redundancy).

netstat is running in a docker container and is raising a critical alert as shown in the attached screenshot. It seems to be treating the empty bay as ‘Down’.

I’ve read the explanation of how this is worked out and the basic calculation is correct - my mdstat info shows

md0 : active raid1 sata3p1[0] sata2p1[2] sata1p1[1]
      2490176 blocks [4/3] [UUU_]

But everything is actually fine - is there a way of configuring netstat to not treat this as an error?

Also, I think the colours seem to be the wrong way round with red being ‘inuse’ and green being ‘down’

Thanks

pk1966 · October 15, 2022, 9:24am

I should also add taht I have run extended SMART tests on all 3 drives and they are healthy. Storage manager also shows the RAID as healthy.

Christopher_Akritid1 · October 16, 2022, 5:47pm

It’s really about getting mdstat to not be showing a missing/failed disk, not about doing something in Netdata.

A shot in the dark here, because I have no previous experience on this and just looked up a couple of things online: It looks like you can remove a disk from an array (see section 3). I can’t figure out how you could even know how to remove something that doesn’t exist/isn’t listed, but it sounds like these disks have predictable identifiers, so in your case it would either be called sata0p1 or sata4p1?

ilyam8 · October 16, 2022, 6:08pm

Hi, @pk1966. The fix can be

excluding your raid device from the default alarm.
creating a custom alarm for your raid device (trigger if down > 1).

To do that you need to configure the “health.d/md.conf” file: we need to copy/paste “mdstat_disks” and add charts filter to both:

 template: mdstat_disks
       on: md.disks
    class: Errors
     type: System
component: RAID
   charts: !*md0* *
    units: failed devices
    every: 10s
     calc: $down
     crit: $this > 0
     info: number of devices in the down state for the $family array. \
           Any number > 0 indicates that the array is degraded.
       to: sysadmin

 template: mdstat_disks_md0
       on: md.disks
    class: Errors
     type: System
component: RAID
   charts: *md0* !*
    units: failed devices
    every: 10s
     calc: $down
     crit: $this > 1
     info: number of devices in the down state for the $family array. \
           Any number > 0 indicates that the array is degraded.
       to: sysadmin

pk1966 · October 16, 2022, 6:34pm

Thanks for the suggestions - because it’s using SHR-1 rather than pure RAID, I don’t want to mess around with mdadm unless there is a specific Synology guide (which I haven’t been able to find).

I’ll have a play with adjusting the alert trigger level as suggested.

Thanks

Christopher_Akritid1 · October 17, 2022, 8:04pm

I asked on Reddit on your behalf @pk1966 and some of the replies are very scary. A couple of people are saying that you might lose data if you leave it like this. Again, I personally don’t know enough about this to judge, but maybe check the replies out here

ilyam8 · October 17, 2022, 8:19pm

My understanding that it is not a big problem:

it is raid1 (mirroring).
the number of active devices expected to be 4 (–raid-devices) - 4-bay NAS.
the actual number of devices (in the bay) is 3.
no worries about the data loss because of the raid mode (mirroring) until at least 1 device is alive.

The default alarm is correct because one device is indeed missing - should be 4.

pk1966 · October 17, 2022, 8:24pm

Thanks for this. I’m going to raise a ticket with Synology support to get their confirmation.

It’s my understanding that it is a variant of RAID5.

Thanks for your help - I’ll come back and let you know what they say

pk1966 · October 18, 2022, 10:02am

I got confirmation from Synology that it is OK. md0 and md1 are system partitions which is mirrored across all available disks - Synology support said “this is not a cause for concern” and included an example of a singe disk NAS which shows [4/1]

So I just need to adjust the alert trigger.

Thanks for your help

cyklee · June 21, 2023, 3:43am

Hi all, sorry for raising an old topic, but regarding having critical alerts from mystery raid device. I have just encountered something similar with QNAP NAS. In my case, these are either swap space and/or something else managed by QTS OS.

milindpatel63 · April 3, 2024, 10:16am

instead of

   charts: !*md0* *

we now have to use

chart labels: device=!*md0* *

in the latest netdata version.

Topic		Replies	Views
mdstat_disks Alerts	0	2662	January 17, 2022
Mdstat ignored - no sensor chart Help agent-collector , agent	7	926	April 19, 2021
adaptec_raid_pd_state Alerts	0	440	November 3, 2021
Monitoring edit backlog help Help	3	309	November 7, 2023
mdstat_mismatch_cnt Alerts	0	816	January 26, 2022

Critical mdstat_disks alert on 4-bay Synology NAS with 3 drives (SHR1)

Related topics