Linux Storage

Linux
Storage
Published

August 27, 2013

Modified

July 14, 2022

Terminology

Data…set of bits in a specific order

Storage device…physical storage…

  • …also called storage media or storage medium
  • Physical (second stage) storage…
    • HDD (Hard Disk Drive)
    • SSD (Solid State Drive)
    • CDs, DVDs, Flash

Drive single piece (virtual) storage device

Physical device can store data temporary or permanent…

  • Temporary volatile memory requires power to maintain stored data
  • Non-volatile persistent storage retains stored data after power off

Random access devices use an abstraction layer called block device

Volumes are a logical abstraction from storage devices

  • …which typically span over multiple physical devices
  • Volumes can contain multiple partitions

Storage devices (and volumes) can be segmented into one of more partitions

A “raw” storage device is initialized with a file-system structure…

  • …controls how information is written to and retrieved from a device
  • Otherwise the storage could not be used for file related operations
  • Application created the file I/O request…
    • …the file system creates a block I/O request…
    • …block I/O drive accesses the disk

I/O

I/O (Input/Output)…

  • I/O is issued to a storage device
  • …an I/O is a single read/write request
  • IOPS (I/O Operations Per Second)

Reading or writing a file can result in multiple I/O requests

  • I/O request have a size…
    • …workloads issue I/O operations with different request sizes
    • Request size impact latency and IOPS
  • Queue depth…number of I/O requests queued (in-flight) on average…
    • …used to optimize access to the storage device
    • …improves throughput at the cost of latency

Access Patterns…

  • Sequential
    • …operates with a large number of sequential (often adjacent) data blocks
    • …may achieve the highest possible throughput on a storage device
  • Random
    • …I/O requests issued in a seemingly random pattern to the storage device
    • …throughput and IOPS will plummet (as compared to a sequential access pattern)

Latency

A storage hierarchy separates storage into a layers

  • …based on latency (response time)
  • …fast and large storage can not be achieved with a single level
  • …multiple levels of storage progressively bigger and slower

Typical latency for different storage layers:

Name Latency Size
Register <1ns B
L1 cache ~1ns ~32KB
L2 cache >1ns <1MB
L3 cache >10ns >1MB
DRAM >100ns GB
SSD 1-3ms TB
HDD 5-18ms TB

Latency…

  • …time it takes for the I/O request to be completed
  • Dictates the responsiveness of individual I/O operations
  • IOPS metric is meaningless without a statement about latency

Throughput

Performance Indicators:

  • Throughput (Tp) – Volume of data processes within a specific time interval.
  • Transactions (Tr) – I/O requests processed by the device in a specific time interval.
  • Average Latency (Al) – Average time for processing a single I/O request.

Throughput and transaction rate are proportional (Block size (Bs))

Tp [MB/s] = Tr [IO/s] × Bs [MB]
Tr [IO/s] = Tp [MB/s] ÷ Bs [MB]

Number of Worker Threads (Wt), Parallel I/Os (P)

Al [ms] = 10³ × Wt × P ÷ Tr [IO/s]

Data transferred with is done in multiples of the block size. It is (usually) the unit of allocation on the device.

The simples test is to write to the file-system with dd:

>>> dd if=/dev/zero of=/tmp/output conv=fdatasync bs=384k count=1k; rm -f /tmp/output
1024+0 records in
1024+0 records out
402653184 bytes (403 MB) copied, 4.28992 s, 93.9 MB/s
>>> hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads:   15852 MB in  2.00 seconds = 7934.52 MB/sec
Timing buffered disk reads: 302 MB in  3.02 seconds = 100.10 MB/sec

Similarly hdparm can run a quit I/O test.

Hierarchy

As distance to the processor increases…

  • …storage size and access time increases
  • Higher levels are faster…but more expensive

CPU Cache

Internal processor registers integrated (small & fast) memory

  • Registers are read/written by machine instructions
  • Categories: state-, address-, and data-registers

Typically three levels of cache memory…

  • L1 Cache - fastest…with the least storage capacity
  • L2 Cache - not as fast…more storage capacity
  • L3 Cache - even less fast..even more storage capacity (SDRAM in certain cases)

Main Memory

Primary storage

  • Operating at high speed compared to secondary storage
  • Usually too small to store all needed programs and data permanently
  • Typically volatile storage device… loses contents when power off

Storage Drives

Secondary storage…usually called disk…

  • …slower than main memory (RAM)
  • can hold the data permanent
  • Mass storage devices…
    • HDD (Hard Disk Drive)
    • SSD (Solid State Drive)
  • Flash memory like USB & SD flash drives and solid-state drives (SSD)
  • Optical media like CDs, DVDs, BlueRay, etc.

Very slow secondary storage…

  • …sometimes called tertiary storage
  • Tape drives for offline long-term data preservation in the scale PBs

Block Devices

Linux manages storage as “block device” (aka block storage)

  • Block devices commonly represent hardware such as disks or SSDs…
    • …representing the storage as a long lineup of bytes
    • Users open a device…seek the place to access…read/write data
    • Read/write communication is in entire blocks (of different sizes)
    • Hardware characteristics are abstracted away by kernel- or driver-level cache
  • Find block devices…
ls -l /dev/[vsh]d* | grep disk             # list device files
dmesg | grep '[vsh][dr][a-z][1-9]' | tr -s ' ' | cut -d' ' -f3-

sysfs used by programs such as udev to access device and device driver information

/sys/block               # contains entries for each block device
/sys/devices             # global device hierarchy of all devices on the system
/sys/dev/block/          # devices with major:minor device numbers

/dev/

Block devices represented by special file system objects called device nodes

  • …visible under the /dev directory
  • Naming convention…
    • …type of device followed by a letter…
    • …signifying the order in which they were detected by the system
    • …defined in /lib/udev/rules.d/60-persistent-storage.rules

List of commonly used device names…

  • hd IDE drives (using the old PATA driver)
    • hda first device on the IDE controller (master)
    • hdb second device (slave)
  • sd SATA/PATA (originally used for SCSI)
    • …usually, all the devices using a serial bus
    • sda first one, sdb second one, etc.
  • nvme NVM Epxress, PCI devices
    • nvme[0-9] indicates the device controller
    • nvme[0-9]n[1-9] indicates a device on a specific controller
  • mmc SD cards, MMC cards and eMMC storage devices
  • vda virtio block device (virtio-blk) interface

lshw, hwinfo

Following commands present information on the storage devices

lshw -class disk -short           # disk devices...
lshw -class storage -short        # storage controllers, scsi, sata, sas, etc
hwinfo --block --short            # devices (and partitions)

lsblk

List block devices…

  • …reads the sysfs and udev db
  • Prints all block devices (except RAM disks) in a tree-like format by default
  • -o…specify output columns
    • --help…list of all available columns
# device vendor information
lsblk -o NAME,VENDOR,MODEL,REV,TYPE,SIZE $dev

Types

  • HDD (Hard Disk Drive)
    • Latency >5ms…
      • …rotational delay…get the right sector
      • …seek time…move the arm to the right track
      • …transfer time…get the bits from the disk
    • Mechanical parts favor…
      • …large sequential access
      • 200x slower for random access
    • Supported interfaces…
      • ATA
      • SCSI
      • SATA
  • SSD (Solid State Drive)…
    • …no mechanical parts like HDD
    • …no vibration and sound
    • …no magnetism and defragmentation
    • …non-volatile flash memory
    • …charge stored in solid material
    • Supported interfaces…
      • SATA (2.5”)
      • M.2
      • NVMe PCIe
      • U.2 PCIe

nvme

NVMe Command Line Interface (NVMe-CLI)

nvme list                      # list all devices
nvme id-ctrl -H /dev/nvme0     # details on a specific controller
nvme error-log /dev/nvme0      # print error log page

smart-log

NVMe support was added to smartmontools in >= 6.5…

# check health and error logs...
smartctl -H -l error /dev/nvme0

Interpret the nvme smat-log output…

  • critical_warning bits…can be tied to asynchronous event notification
    • 0…available Spare is below Threshold
    • 1…temperature has exceeded Threshold
    • 2…reliability is degraded due to excessive media or internal errors
    • 3…media is placed in Read- Only Mode
    • 4…volatile Memory Backup System has failed
    • 5-7…reserved
  • available_spare_threshold
    • When the available spare space is lower than the threshold…
    • …alert that the remaining life of flash memory is insufficient
  • percentage_used…estimated used endurance of the device
  • num_err_log_entries…error Information log entries over the life of the controller
  • media_errors non zero
    • ECC or CRC verification failure or LBA label mismatch error…
    • …cannot be corrected by the error correction engine
    • Non 0 means that device is not stable

Partitions

Segments the available storage space into one or more regions…

  • …partition schema is limited to a single disk
  • …just a continuous set of storage blocks
  • …partitions on a storage device are identified by a partition table
/proc/partitions
file -s <device>                               # read partition info from device
dd if=/dev/zero bs=512 count=1 of=<device>     # wipe the bootsector of a devices

Create a new “disk label” aka partition table

# GUID Partition Table default on all EFI systems
parted $deviec mklabel gpt
# denpricated MBR (master boot Record) or MS-DOS partition table
parted $device mklabel msdos

Create partitions:

parted -l                                      # list devices/partitons
parted $device print free                     # show free strorage on device
# create single partiton using the entire device
parted -a optimal $device mkpart primary 0% 100%
parted $device rm $number                      # delete partition

POSIX

Large distributed applications are highly concurrent and often multi-tenant:

  • POSIX was meant to provide a consistent view on data for (all) clients
  • While storage systems are scaled, the time to sync data to guarantee consistency increases
  • Applications can be dramatically slowed by bad I/O design (e.g. shared log-files, small I/O operations)
  • Performance limits associated with lock contention force applications to a many-file I/O patterns

Highly scaleable distributed storage can not longer support the assumption that all applications look at the same view of the data. Short-term POSIX I/O may be optimized:

  • I/O needs to be engineered like computer algorithms are profiled to improve performance
  • Sophisticated implementation may reduce the reduces the consistency domain to a single client
  • New layers in the I/O stack like burst buffers may improve efficiency of I/O in legacy applications

Long-term I/O needs to move away from POSIX (with consistent concurrent reads and writes):

  • Object base storage requires data movement to be part of the application design
  • “Lazy” data organization with directory trees suddenly disappears
  • Performance optimizations of I/O patterns is required for each individual storage infrastructure

Eventually a distinction between read-only, write-only and read-write data is required.

  • Most of the data should be read-only and immutable (after being written once) (write-once, read-many (WORM))
  • Write-only data (e.g. checkpoints) data can be signed off when the writer has finished
  • Data constantly in change (read-write) should live in a database

File-Systems

/proc/filesystems                         # list of supported file-systems
/proc/self/mountinfo                      # mount information
lsblk -f                                  # list block devices with file-system type

Format a partition with a specific file-system:

mkfs.$type $partition                     # init fs on partition
mkfs.ext4 /dev/sdb1

File system type can have a label:

/dev/disk/by-label                        # list of devices partiions by label
mkfs.$type -L $label ...                  # add a file-system label
# set the file-system label on ext{2,3,4} file-system type partition 
e2label ${part:-/dev/sda1} ${label:-root}
tune2fs -L ${label:-root} ${part:-/dev/sda1}
# change the label of an exFAT formated partition
exfatlabel ${part:-/dev/sdc1} ${label:-usb}

Multi user support with ACLs:

mnt=/mnt                      # mount point within the root files-ystem
part=/dev/sdc1                # for example, change this to your needs!
mkfs.ext4 $part               # create a file-system with ACL support
tune2fs -o acl $part          # enable ACLs
mount $part $mnt              # mount the partition
chown $user: $mnt
chmod 777 $mnt
setfacl -m d:u::rwx,d:g::rwx,d:o::rwx $mnt
umount $mnt

Mounts

List file systems mounted on the system:

findmnt                                   # show tree all file systems
findmnt -l                                # list all file systems
findmnt -D                                # output like df
findmnt -s                                # from /etc/fstab
findmnt -S /dev/<device>                  # by source device
findmnt -T <path>                         # by mount point
findmnt -t <type>,...                     # by type, e.g. nfs

Mount a partition from a storage device:

sudo mount $partition $mntpoint    # mount filesystem located on a device partition

Mount a hot-plug devices like a USB drive as normal user:

sudo apt install -y pmount
pmount ${device:-/dev/sdb1} ${label:-storage}
pumount $device

The device partition is mounted below /media/$label

POSIX

POSIX I/O was designed for local storage (disks) with serial processors and workloads.

The POSIX I/O API defines how applications read/write data:

  • Function calls for applications/libraries like open(), close(), read() and write()
  • The POSIX semantics define what is guaranteed to happen with each API call
  • E.g write() is strongly consistent and guaranteed to happen before any subsequent read()

POSIX I/O is stateful:

  • File descriptors are central to this process
  • The persistent state of data is maintained by tracking all file descriptors
  • Typically the cost of open() scales linearly with the number of clients making a request

POSIX I/O prescribes a specific set of metadata that all files must possess:

  • Metadata includes ownership, permissions, etc.
  • Each file is treated independently, recursive changes are very costly
  • The POSIX metadata schema at scales is difficult to support

Typically page cache is used to soften the latency penalty forced by POSIX consistency. Distributed storage can not efficiently use page cache since it is not shared among clients. Parallel file-systems may implement techniques like:

  • No us of a page cache increasing the I/O latency for small writes
  • Violate (or “relax”) POSIX consistency when clients modify non-overlapping parts of a file
  • Implement a distributed lock mechanism to manage concurrency

Page Cache

Page cache accelerates accesses to files on non volatile storage for two reasons:

  1. Overcome the slow performance of permanent storage (like hard disk)
  2. Load data only once into RAM and share it between programs

The page cache uses free areas of memory as cache storage:

  • All regular file I/O happens through the page cache
  • Data not in sync with the storage marked as dirty pages

Dirty pages are periodically synchronized as soon as resources are available

  • After programs write data to the page cache it is as marked dirty
  • The program dose not block waiting for the write to be finished
  • Until the sync is completed power failure may lead to data loss
  • Write of critical data requires explicit blocking until data is written
  • Programs reading data typically block until the is available
  • The kernel uses read ahead to preload data in anticipation of sequential reads

The kernel frees the memory used for page cache if it is required for other applications:

free -hw                  # shows page cache in dedicated column

Force the Linux kernel to synchronize dirty pages with the storage:

sync                      # force write of dirty pages
# track the progress in writing dirty pages to storage:
watch -d grep -e Dirty: -e Writeback: /proc/meminfo

/proc/diskstats

I/O statistics of block devices. Each line contains the following 14 fields:

1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)

iostat & iotop

iostat I/O statistics for partitions…option -k prints values in kilobytes

>>> iostat -xk 1  | awk '/sda/ {print $6,$7}'                  
14.36 162.23
0.00 9144.00
0.00 3028.00
...

iotop…list of processes/threads consuming IO bandwidth

  • interactive mode use the arrow keys to select the column used for sorting
  • o limits the view to active processes, and a accumulates the I/O counters.
  • Limit output with option -Po for active processes only
  • Option -a accumulates I/O -b enables non-interactive batch mode:
>>> iotop -bPao -u $USER
Total DISK READ:       0.00 B/s | Total DISK WRITE:       0.00 B/s
PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
25722 be/4 vpenso        0.00 B      8.14 M  0.00 %  0.00 % root.exe []
25728 be/4 vpenso        0.00 B      6.75 M  0.00 %  0.00 % root.exe []
25750 be/4 vpenso        0.00 B      8.00 K  0.00 %  0.00 % root.exe []
25739 be/4 vpenso        0.00 B      8.57 M  0.00 %  0.00 % root.exe []
...

Benchmark

hdparm

  • …(non-destructively) write for three seconds
  • …reading through the buffer cache…without any prior caching
  • without file-system overhead
  • Timings of device reads options…
    • …repeated 2-3 times on an otherwise inactive system
    • -t…indication of how fast the drive can sustain sequential reads
    • -T…indication of the throughput of the processor, cache, and memory
    • --direct…kernel O_DIRECT flag…bypasses the page cache…
dev=/dev/sda   # adjust to a device node
for i in $(seq 3) 
do  
        hdparm -tT --direct $dev
        hdparm -tT $dev
done

Results with --direct….

Device Type Size (GB) Cached Reads Disk Reads
SAMSUNG MZQL21T9HCJR-00B7C NVMe 1920 2720.44 2727.02
INTEL SSDSC2KB480G8 SATA 480 481.49 589.74
SAMSUNG SSD 850 SATA 256 480.49 491.34

fio

Flexible IO Tester

Bird view to the job file configuration…

  • I/O pattern…sequential, random, mixed…
  • Block size…
  • I/O size…overall data read/write
  • I/O engine…how the job issues I/O
  • I/O depth…for async I/O engine
  • Targets…number of files and workloads
  • Threads/Processes…should we spread this workload over

Very simple benchmark example…

# create a job file
cat > /tmp/simple.fio <<EOF
[job]
filename=/tmp/test.file
filesize=1g
readwrite=randread
bs=4k
EOF
# execute the benchmark
fio /tmp/simple.fio

Job file format…

  • --cmdhelp lists all options
  • Examples in /usr/share/doc/fio/examples