Critical mdstat_disks alert on 4-bay Synology NAS with 3 drives (SHR1)

I have a Synology DS920+ 4-bay NAS, populated with 3 HDDs and configured using SHR1 (Synology’s Hybrid RAID with 1 disk redundancy).

netstat is running in a docker container and is raising a critical alert as shown in the attached screenshot. It seems to be treating the empty bay as ‘Down’.

I’ve read the explanation of how this is worked out and the basic calculation is correct - my mdstat info shows

md0 : active raid1 sata3p1[0] sata2p1[2] sata1p1[1]
      2490176 blocks [4/3] [UUU_]

But everything is actually fine - is there a way of configuring netstat to not treat this as an error?

Also, I think the colours seem to be the wrong way round with red being ‘inuse’ and green being ‘down’

Thanks

I should also add taht I have run extended SMART tests on all 3 drives and they are healthy. Storage manager also shows the RAID as healthy.

It’s really about getting mdstat to not be showing a missing/failed disk, not about doing something in Netdata.

A shot in the dark here, because I have no previous experience on this and just looked up a couple of things online: It looks like you can remove a disk from an array (see section 3). I can’t figure out how you could even know how to remove something that doesn’t exist/isn’t listed, but it sounds like these disks have predictable identifiers, so in your case it would either be called sata0p1 or sata4p1?

Hi, @pk1966. The fix can be

  • excluding your raid device from the default alarm.
  • creating a custom alarm for your raid device (trigger if down > 1).

To do that you need to configure the “health.d/md.conf” file: we need to copy/paste “mdstat_disks” and add charts filter to both:

 template: mdstat_disks
       on: md.disks
    class: Errors
     type: System
component: RAID
   charts: !*md0* *
    units: failed devices
    every: 10s
     calc: $down
     crit: $this > 0
     info: number of devices in the down state for the $family array. \
           Any number > 0 indicates that the array is degraded.
       to: sysadmin

 template: mdstat_disks_md0
       on: md.disks
    class: Errors
     type: System
component: RAID
   charts: *md0* !*
    units: failed devices
    every: 10s
     calc: $down
     crit: $this > 1
     info: number of devices in the down state for the $family array. \
           Any number > 0 indicates that the array is degraded.
       to: sysadmin
2 Likes

Thanks for the suggestions - because it’s using SHR-1 rather than pure RAID, I don’t want to mess around with mdadm unless there is a specific Synology guide (which I haven’t been able to find).

I’ll have a play with adjusting the alert trigger level as suggested.

Thanks

I asked on Reddit on your behalf @pk1966 and some of the replies are very scary. A couple of people are saying that you might lose data if you leave it like this. Again, I personally don’t know enough about this to judge, but maybe check the replies out here

My understanding that it is not a big problem:

  • it is raid1 (mirroring).
  • the number of active devices expected to be 4 (–raid-devices) - 4-bay NAS.
  • the actual number of devices (in the bay) is 3.
  • no worries about the data loss because of the raid mode (mirroring) until at least 1 device is alive.

The default alarm is correct because one device is indeed missing - should be 4.

Thanks for this. I’m going to raise a ticket with Synology support to get their confirmation.

It’s my understanding that it is a variant of RAID5.

Thanks for your help - I’ll come back and let you know what they say

I got confirmation from Synology that it is OK. md0 and md1 are system partitions which is mirrored across all available disks - Synology support said “this is not a cause for concern” and included an example of a singe disk NAS which shows [4/1]

So I just need to adjust the alert trigger.

Thanks for your help

Hi all, sorry for raising an old topic, but regarding having critical alerts from mystery raid device. I have just encountered something similar with QNAP NAS. In my case, these are either swap space and/or something else managed by QTS OS.

instead of

   charts: !*md0* *

we now have to use

chart labels: device=!*md0* *

in the latest netdata version.