SMART Reliability Monitoring
SMART (Self-Monitoring, Analysis and Reporting Technology)…
- …for HDD, SSD, eMMC devices
- Monitor drive reliability and performance counters
- Anticipating imminent hardware failures…
- …notify the user so preventive action
- Failing drive can be replaced and data integrity maintained
References…
Package smartmontools on RPM distributions…
smartclt
smartclt ... <device> used to address a storage device…
devicederived from the path to the device node/dev/sd[a-z]for SATA/dev/nvme[0-9]NVMe (broadcast),/dev/nvme[0-9]n[1-9](specific namespace)
Find storage devices…
# smartctl --scan-open
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d sat+megaraid,2 # /dev/bus/0 [megaraid_disk_02] [SAT], ATA device
/dev/bus/0 -d sat+megaraid,3 # /dev/bus/0 [megaraid_disk_03] [SAT], ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device--scan…device name, device type and protocol--scan-open- …open each device before printing device info
- …used to create a draft
smartd.conffile --appended to each output line
smartctl --scan-open -- $options > /etc/smartmontools/smartd.confRAID Controllers
Accessing a device node provided by a RAID controller will show…
SMART support is: Unavailable - device lacks SMART capability.
Individual devices in a RAID need to be address individually…
- Option
-d,--device=type of the device…megaraid,N…Ndonates which disk on the controller is monitored3ware,N…Ndonates which disk on the controller is monitored
Example for a MegaRAID controller…
# device on a MegaRAID LSI controller...
smartctl -a -d megaraid,2 /dev/bus/0
# ...in case for a SATA device...
smartctl -a -d sat+megaraid,2 /dev/sdaExit Bitmask
- Exit status defined by a bitmask…
- Return value
0…all bits turned off - Non-zero status indicates an error
- Return value
- Bits…have the following meanings
0…command line did not parse1…device open failed2…command to the device failed3…returned disk failing4…found prefail attributes <= threshold5…returned disk ok…but prefail attributes <= threshold6…error log contains records of errors7…self-test log contains records of errors
# looks at only at bit 3 of the exit status $?
smartctl -q silent -a $device ; echo $(($? & 8))
# prints all status bits
smartctl -q silent -a $device
for i in 0 1 2 3 4 5 6 7; do
echo "Bit $i: $(((val & mask) && 1))"
mask=$((mask << 1))
doneAttributes
All SMART attributes -a, --all print all available information…or…
- …
-i,--infodevice model number, serial number, firmware version - …
-c,--capabilitiesgeneric SMART capabilities
Overall Health
Health status of the device…option -H, --health
PASSEDdrive is in good healthFAILED…- …device has already failed…
- or…drive failure is imminent…data should be backed up
# output with warning...
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.- …
Warning: This result is based on an Attribute check.- …unknown ATA command return status…
- …due to bridge devices RAID controller or USB bridge firmware
- …state determined by pre-failure SMART attributes
- …considered less reliable
Pre-failure
Option -A, --attributes…
# example from a SATA SSD...
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3460
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1148
...RAW_VALUE- …might have a real physical interpretation (such as temperature)
- …meaning of these attribute fields has been made entirely vendor-specific
- …manufacturer converts these to normalized
VALUE(between 1-254)
VALUE…current value of the attributeWORST…worst (typically lowest) value SMART has ever seenTHRESH…vendors lowest possible value considered as healthyWHEN_FAILED…-…no entry…attribute is not failingFAILING_NOW…normalizedVALUEless than or equal toTHRESH…
smartd
smartd monitors SMART statuses and emits notifications
cat /etc/sysconfig/smartmontools # Systemd service unit
/etc/sysconfig/smartmontools # service environment
systemctl cat smartd.service # systemd service unit
journalctl -u smartd # log informationConfiguration
smartd -D list all configuration directives…
Text string in capital letters…
DEVICESCANtext string in capital letters…- …ignore any remaining lines in the configuration
- …specific devices may precede the DEVICESCAN entry
DEFAULT…set as defaults for the next device entries
Simple example for a configuration file…default in /etc/smartmontools/smartd.conf
DEFAULT -m root@example.com
/dev/sda -s S/../.././02
/dev/sdc -d ignore
DEVICESCAN -s L/../.././02
-s <regex>device test configuration- …extended regular expression
T/MM/DD/d/HH… - …
.matches any single character Tset to self-testLlong orSshort…MMmonth,DDday,dday of the week,HHhour
- …extended regular expression
Notification Mail
Todo…
-m <address>…send a warning email…-M once(default) send only one warning email for each type of disk problem-M exec <script>…run the executable instead of the default mail command
Device Temperature
Option -w <diff>[,<info>[,<crit>]]…
- …current temperature had changed by at least
diffdegrees (Celsius) since last report - …warn if the temperature is greater or equal than one of
infoorcrit - If limit
critis reached… log message… send a warning mail if specified
-W 2 # track temperature changes of at least 2 degrees
-W 0,40 # log informal messages on temperatures of at least 40 degrees
-W 0,0,45 # warning messages/mails on temperatures of at least 45 degrees
-W 2,40,45 # combine all of the above reportsDebug
Simple configuration test…
-c -read configuration fromstdin-q onecheck…- …smartd in debug mode…
- …register devices…
- …check device’s SMART status once…
- …exit 0 if correctly executed
# configuration via stdin...execute once
echo '/dev/sda -d megaraid,2' | smartd -c - -q onecheckDebug mode… smartd -d
-d,--debugdisplays verbose status information to standard outctrl-cto reload the configuration filectrl-\to shutdown…
- Include information on…
- All SMART capable devices…supporting health checks…
- Devices with can not be registered…