zfs_pool_state_warn
OS: Any
ZFS is a local file system and logical volume manager created by Sun Microsystems Inc. to direct and control the placement, storage, and retrieval of data in enterprise-class computing systems. ZFS is scalable, suitable for high storage capacities, and includes extensive protection against data corruption.
The Netdata Agent monitors the state of the ZFS pool. Receiving this alert means that the ZFS pool is degraded.
See more on ZFS pool health status
The ZFS pool health status as described in the Oracle’s
website 1
ZFS provides an integrated method of examining pool and device health. The health of a pool is determined from the state of all its devices. This state information is displayed by using the zpool status command. In addition, potential pool and device failures are reported by fmd
, displayed on the system console, and logged in the /var/adm/messages
file.
Each device can fall into one of the following states:
-
ONLINE, the device or virtual device is in normal working order. Although some transient errors
might still occur, the device is otherwise in working order. -
DEGRADED, the virtual device has experienced a failure but can still function. This state is most
common when a mirror or RAID-Z device has lost one or more constituent devices. The fault
tolerance of the pool might be compromised, as a subsequent fault in another device might be
unrecoverable. -
FAULTED, the device or virtual device is completely inaccessible. This status typically indicates
total failure of the device, such that ZFS is incapable of sending data to it or receiving data
from it. If a top-level virtual device is in this state, then the pool is completely inaccessible. -
OFFLINE, the device has been explicitly taken offline by the administrator.
-
UNAVAIL, the device or virtual device cannot be opened. In some cases, pools with UNAVAIL devices
appear in DEGRADED mode. If a top-level virtual device is UNAVAIL, then nothing in the pool can be
accessed. -
REMOVED, the device was physically removed while the system was running. Device removal detection
is hardware-dependent and might not be supported on all platforms.
The health of a pool is determined from the health of all its top-level virtual devices. If all virtual devices are ONLINE, then the pool is also ONLINE. If any one of the virtual devices is DEGRADED or UNAVAIL, then the pool is also DEGRADED. If a top-level virtual device is FAULTED or OFFLINE, then the pool is also FAULTED. A pool in the FAULTED state is completely inaccessible. No data can be recovered until the necessary devices are attached or repaired. A pool in the DEGRADED state continues to run, but you might not achieve the same level of data redundancy or data throughput than if the pool were online. 1
References and source
Troubleshooting section
Replace a failed drive
Data is priceless. Before you perform any action, make sure that you have taken any necessary backup steps. Netdata is not liable for any loss or corruption of any data, database, or software.
Based on a nice guide from the AlchemyCS blogspot.
- Check your zpool status
root@netdata # zpool status my_pool
pool: my_pool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0B in 0h0m with 0 errors on Sun May 12 00:24:51 2019
config:
NAME STATE READ WRITE CKSUM
my_pool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
52009894889112747750 UNAVAIL 0 0 0 was /dev/sdm5
sdb5 ONLINE 0 0 0errors: No known data errors
-
Find the UUIDs (for GPT) of the faulty dev (the UNAVAIL) and the new disk you want to add.
Useblkid
for Linux orgeom
utility for FreeBSD -
Offline the UNAVAIL drive
root@netdata # zpool offline my_pool /dev/disk/by-uuid/{UUID_BAD_DRIVE}
- Replace it in place
root@netdata # zpool replace -f my_pool /dev/disk/by-uuid/{UUID_OLD_DRIVE} /dev/disk/by-uuid/{UUID_NEW_DRIVE}