SMART Reliability Monitoring
SMART (Self-Monitoring, Analysis and Reporting Technology)…
- …for HDD, SSD, eMMC devices
- Monitor drive reliability and performance counters
- Anticipating imminent hardware failures…
- …notify the user so preventive action
- Failing drive can be replaced and data integrity maintained
References…
Package smartmontools
on RPM distributions…
smartclt
smartclt ... <device>
used to address a storage device…
device
derived from the path to the device node/dev/sd[a-z]
for SATA/dev/nvme[0-9]
NVMe (broadcast),/dev/nvme[0-9]n[1-9]
(specific namespace)
Find storage devices…
# smartctl --scan-open
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d sat+megaraid,2 # /dev/bus/0 [megaraid_disk_02] [SAT], ATA device
/dev/bus/0 -d sat+megaraid,3 # /dev/bus/0 [megaraid_disk_03] [SAT], ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device
--scan
…device name, device type and protocol--scan-open
- …open each device before printing device info
- …used to create a draft
smartd.conf
file --
appended to each output line
smartctl --scan-open -- $options > /etc/smartmontools/smartd.conf
RAID Controllers
Accessing a device node provided by a RAID controller will show…
SMART support is: Unavailable - device lacks SMART capability.
Individual devices in a RAID need to be address individually…
- Option
-d
,--device=
type of the device…megaraid,N
…N
donates which disk on the controller is monitored3ware,N
…N
donates which disk on the controller is monitored
Example for a MegaRAID controller…
# device on a MegaRAID LSI controller...
smartctl -a -d megaraid,2 /dev/bus/0
# ...in case for a SATA device...
smartctl -a -d sat+megaraid,2 /dev/sda
Exit Bitmask
- Exit status defined by a bitmask…
- Return value
0
…all bits turned off - Non-zero status indicates an error
- Return value
- Bits…have the following meanings
0
…command line did not parse1
…device open failed2
…command to the device failed3
…returned disk failing4
…found prefail attributes <= threshold5
…returned disk ok…but prefail attributes <= threshold6
…error log contains records of errors7
…self-test log contains records of errors
# looks at only at bit 3 of the exit status $?
smartctl -q silent -a $device ; echo $(($? & 8))
# prints all status bits
smartctl -q silent -a $device
for i in 0 1 2 3 4 5 6 7; do
echo "Bit $i: $(((val & mask) && 1))"
mask=$((mask << 1))
done
Attributes
All SMART attributes -a
, --all
print all available information…or…
- …
-i
,--info
device model number, serial number, firmware version - …
-c
,--capabilities
generic SMART capabilities
Overall Health
Health status of the device…option -H
, --health
PASSED
drive is in good healthFAILED
…- …device has already failed…
- or…drive failure is imminent…data should be backed up
# output with warning...
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
- …
Warning: This result is based on an Attribute check.
- …unknown ATA command return status…
- …due to bridge devices RAID controller or USB bridge firmware
- …state determined by pre-failure SMART attributes
- …considered less reliable
Pre-failure
Option -A
, --attributes
…
# example from a SATA SSD...
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3460
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1148
...
RAW_VALUE
- …might have a real physical interpretation (such as temperature)
- …meaning of these attribute fields has been made entirely vendor-specific
- …manufacturer converts these to normalized
VALUE
(between 1-254)
VALUE
…current value of the attributeWORST
…worst (typically lowest) value SMART has ever seenTHRESH
…vendors lowest possible value considered as healthyWHEN_FAILED
…-
…no entry…attribute is not failingFAILING_NOW
…normalizedVALUE
less than or equal toTHRESH
…
smartd
smartd
monitors SMART statuses and emits notifications
cat /etc/sysconfig/smartmontools # Systemd service unit
/etc/sysconfig/smartmontools # service environment
systemctl cat smartd.service # systemd service unit
journalctl -u smartd # log information
Configuration
smartd -D
list all configuration directives…
Text string in capital letters…
DEVICESCAN
text string in capital letters…- …ignore any remaining lines in the configuration
- …specific devices may precede the DEVICESCAN entry
DEFAULT
…set as defaults for the next device entries
Simple example for a configuration file…default in /etc/smartmontools/smartd.conf
DEFAULT -m root@example.com
/dev/sda -s S/../.././02
/dev/sdc -d ignore
DEVICESCAN -s L/../.././02
-s <regex>
device test configuration- …extended regular expression
T/MM/DD/d/HH
… - …
.
matches any single character T
set to self-testL
long orS
short…MM
month,DD
day,d
day of the week,HH
hour
- …extended regular expression
Notification Mail
Todo…
-m <address>
…send a warning email…-M once
(default) send only one warning email for each type of disk problem-M exec <script>
…run the executable instead of the default mail command
Device Temperature
Option -w <diff>[,<info>[,<crit>]]
…
- …current temperature had changed by at least
diff
degrees (Celsius) since last report - …warn if the temperature is greater or equal than one of
info
orcrit
- If limit
crit
is reached… log message… send a warning mail if specified
-W 2 # track temperature changes of at least 2 degrees
-W 0,40 # log informal messages on temperatures of at least 40 degrees
-W 0,0,45 # warning messages/mails on temperatures of at least 45 degrees
-W 2,40,45 # combine all of the above reports
Debug
Simple configuration test…
-c -
read configuration fromstdin
-q onecheck
…- …smartd in debug mode…
- …register devices…
- …check device’s SMART status once…
- …exit 0 if correctly executed
# configuration via stdin...execute once
echo '/dev/sda -d megaraid,2' | smartd -c - -q onecheck
Debug mode… smartd -d
-d
,--debug
displays verbose status information to standard outctrl-c
to reload the configuration filectrl-\
to shutdown…
- Include information on…
- All SMART capable devices…supporting health checks…
- Devices with can not be registered…