SMART Reliability Monitoring

Linux
Storage
Published

August 12, 2019

Modified

May 14, 2024

SMART (Self-Monitoring, Analysis and Reporting Technology)…

References…

Package smartmontools on RPM distributions…

smartclt

smartclt ... <device> used to address a storage device…

  • device derived from the path to the device node
    • /dev/sd[a-z] for SATA
    • /dev/nvme[0-9] NVMe (broadcast), /dev/nvme[0-9]n[1-9] (specific namespace)

Find storage devices…

# smartctl --scan-open
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d sat+megaraid,2 # /dev/bus/0 [megaraid_disk_02] [SAT], ATA device
/dev/bus/0 -d sat+megaraid,3 # /dev/bus/0 [megaraid_disk_03] [SAT], ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device
  • --scan…device name, device type and protocol
  • --scan-open
    • …open each device before printing device info
    • …used to create a draft smartd.conf file
    • -- appended to each output line
smartctl --scan-open -- $options > /etc/smartmontools/smartd.conf

RAID Controllers

Accessing a device node provided by a RAID controller will show…

SMART support is:     Unavailable - device lacks SMART capability.

Individual devices in a RAID need to be address individually…

  • Option -d, --device= type of the device…
    • megaraid,NN donates which disk on the controller is monitored
    • 3ware,NN donates which disk on the controller is monitored

Example for a MegaRAID controller…

# device on a MegaRAID LSI controller...
smartctl -a -d megaraid,2 /dev/bus/0
# ...in case for a SATA device...
smartctl -a -d sat+megaraid,2 /dev/sda

Exit Bitmask

  • Exit status defined by a bitmask
    • Return value 0…all bits turned off
    • Non-zero status indicates an error
  • Bits…have the following meanings
    • 0…command line did not parse
    • 1…device open failed
    • 2…command to the device failed
    • 3…returned disk failing
    • 4…found prefail attributes <= threshold
    • 5…returned disk ok…but prefail attributes <= threshold
    • 6…error log contains records of errors
    • 7…self-test log contains records of errors
# looks at only at bit 3 of the exit status $?
smartctl -q silent -a $device ; echo $(($? & 8))

# prints all status bits
smartctl -q silent -a $device
for i in 0 1 2 3 4 5 6 7; do
          echo "Bit $i: $(((val & mask) && 1))"
          mask=$((mask << 1))
done

Attributes

All SMART attributes -a, --all print all available information…or…

  • -i, --info device model number, serial number, firmware version
  • -c, --capabilities generic SMART capabilities

Overall Health

Health status of the device…option -H, --health

  • PASSED drive is in good health
  • FAILED
    • …device has already failed…
    • or…drive failure is imminent…data should be backed up
# output with warning...
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED 
Warning: This result is based on an Attribute check.
  • Warning: This result is based on an Attribute check.
    • …unknown ATA command return status…
    • …due to bridge devices RAID controller or USB bridge firmware
    • …state determined by pre-failure SMART attributes
    • …considered less reliable

Pre-failure

Option -A, --attributes

# example from a SATA SSD...
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME        FLAG   VALUE WORST THRESH TYPE    UPDATED WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct 0x0032 100   100   000    Old_age Always      -       0
  9 Power_On_Hours        0x0032 100   100   000    Old_age Always      -       3460
 12 Power_Cycle_Count     0x0032 100   100   000    Old_age Always      -       1148
...
  • RAW_VALUE
    • …might have a real physical interpretation (such as temperature)
    • …meaning of these attribute fields has been made entirely vendor-specific
    • …manufacturer converts these to normalized VALUE (between 1-254)
  • VALUE…current value of the attribute
  • WORST…worst (typically lowest) value SMART has ever seen
  • THRESH…vendors lowest possible value considered as healthy
  • WHEN_FAILED
    • -…no entry…attribute is not failing
    • FAILING_NOW…normalized VALUE less than or equal to THRESH

smartd

smartd monitors SMART statuses and emits notifications

cat /etc/sysconfig/smartmontools    # Systemd service unit
/etc/sysconfig/smartmontools        # service environment
systemctl cat smartd.service        # systemd service unit
journalctl -u smartd                # log information

Configuration

smartd -D list all configuration directives…

Text string in capital letters…

  • DEVICESCAN text string in capital letters…
    • …ignore any remaining lines in the configuration
    • …specific devices may precede the DEVICESCAN entry
  • DEFAULT …set as defaults for the next device entries

Simple example for a configuration file…default in /etc/smartmontools/smartd.conf

DEFAULT -m root@example.com
/dev/sda -s S/../.././02
/dev/sdc -d ignore
DEVICESCAN -s L/../.././02
  • -s <regex> device test configuration
    • …extended regular expression T/MM/DD/d/HH
    • . matches any single character
    • T set to self-test L long or S short…
    • MM month, DD day, d day of the week, HH hour

Notification Mail

Todo…

  • -m <address>…send a warning email…
  • -M once (default) send only one warning email for each type of disk problem
  • -M exec <script>…run the executable instead of the default mail command

Device Temperature

Option -w <diff>[,<info>[,<crit>]]

  • …current temperature had changed by at least diff degrees (Celsius) since last report
  • …warn if the temperature is greater or equal than one of info or crit
  • If limit crit is reached… log message… send a warning mail if specified
-W 2        # track temperature changes of at least 2 degrees
-W 0,40     # log informal messages on temperatures of at least 40 degrees
-W 0,0,45   # warning messages/mails on temperatures of at least 45 degrees
-W 2,40,45  # combine all of the above reports

Debug

Simple configuration test…

  • -c - read configuration from stdin
  • -q onecheck
    • …smartd in debug mode…
    • …register devices…
    • …check device’s SMART status once…
    • …exit 0 if correctly executed
# configuration via stdin...execute once
echo '/dev/sda -d megaraid,2' | smartd -c - -q onecheck

Debug mode… smartd -d

  • -d, --debug displays verbose status information to standard out
    • ctrl-c to reload the configuration file
    • ctrl-\ to shutdown…
  • Include information on…
    • All SMART capable devices…supporting health checks…
    • Devices with can not be registered…