Linux Storage
Terminology
Data…set of bits in a specific order
Storage device…physical storage…
- …also called storage media or storage medium
- Physical (second stage) storage…
- HDD (Hard Disk Drive)
- SSD (Solid State Drive)
- CDs, DVDs, Flash
Drive single piece (virtual) storage device
Physical device can store data temporary or permanent…
- Temporary volatile memory requires power to maintain stored data
- Non-volatile persistent storage retains stored data after power off
Random access devices use an abstraction layer called block device
Volumes are a logical abstraction from storage devices
- …which typically span over multiple physical devices
- Volumes can contain multiple partitions
Storage devices (and volumes) can be segmented into one of more partitions
A “raw” storage device is initialized with a file-system structure…
- …controls how information is written to and retrieved from a device
- Otherwise the storage could not be used for file related operations
- Application created the file I/O request…
- …the file system creates a block I/O request…
- …block I/O drive accesses the disk
I/O
I/O (Input/Output)…
- I/O is issued to a storage device
- …an I/O is a single read/write request
- IOPS (I/O Operations Per Second)
Reading or writing a file can result in multiple I/O requests
- I/O request have a size…
- …workloads issue I/O operations with different request sizes
- Request size impact latency and IOPS
- Queue depth…number of I/O requests queued (in-flight) on average…
- …used to optimize access to the storage device
- …improves throughput at the cost of latency
Access Patterns…
- Sequential
- …operates with a large number of sequential (often adjacent) data blocks
- …may achieve the highest possible throughput on a storage device
- Random
- …I/O requests issued in a seemingly random pattern to the storage device
- …throughput and IOPS will plummet (as compared to a sequential access pattern)
Latency
A storage hierarchy separates storage into a layers
- …based on latency (response time)
- …fast and large storage can not be achieved with a single level
- …multiple levels of storage progressively bigger and slower
Typical latency for different storage layers:
Name | Latency | Size |
---|---|---|
Register | <1ns | B |
L1 cache | ~1ns | ~32KB |
L2 cache | >1ns | <1MB |
L3 cache | >10ns | >1MB |
DRAM | >100ns | GB |
SSD | 1-3ms | TB |
HDD | 5-18ms | TB |
Latency…
- …time it takes for the I/O request to be completed
- Dictates the responsiveness of individual I/O operations
- IOPS metric is meaningless without a statement about latency
Throughput
Performance Indicators:
- Throughput (Tp) – Volume of data processes within a specific time interval.
- Transactions (Tr) – I/O requests processed by the device in a specific time interval.
- Average Latency (Al) – Average time for processing a single I/O request.
Throughput and transaction rate are proportional (Block size (Bs))
Tp [MB/s] = Tr [IO/s] × Bs [MB]
Tr [IO/s] = Tp [MB/s] ÷ Bs [MB]
Number of Worker Threads (Wt), Parallel I/Os (P)
Al [ms] = 10³ × Wt × P ÷ Tr [IO/s]
Data transferred with is done in multiples of the block size. It is (usually) the unit of allocation on the device.
The simples test is to write to the file-system with dd
:
>>> dd if=/dev/zero of=/tmp/output conv=fdatasync bs=384k count=1k; rm -f /tmp/output
1024+0 records in
1024+0 records out
402653184 bytes (403 MB) copied, 4.28992 s, 93.9 MB/s
>>> hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 15852 MB in 2.00 seconds = 7934.52 MB/sec
Timing buffered disk reads: 302 MB in 3.02 seconds = 100.10 MB/sec
Similarly hdparm
can run a quit I/O test.
Hierarchy
As distance to the processor increases…
- …storage size and access time increases
- Higher levels are faster…but more expensive
CPU Cache
Internal processor registers integrated (small & fast) memory
- Registers are read/written by machine instructions
- Categories: state-, address-, and data-registers
Typically three levels of cache memory…
- L1 Cache - fastest…with the least storage capacity
- L2 Cache - not as fast…more storage capacity
- L3 Cache - even less fast..even more storage capacity (SDRAM in certain cases)
Main Memory
Primary storage
- Operating at high speed compared to secondary storage
- Usually too small to store all needed programs and data permanently
- Typically volatile storage device… loses contents when power off
Storage Drives
Secondary storage…usually called disk…
- …slower than main memory (RAM)
- can hold the data permanent
- Mass storage devices…
- HDD (Hard Disk Drive)
- SSD (Solid State Drive)
- Flash memory like USB & SD flash drives and solid-state drives (SSD)
- Optical media like CDs, DVDs, BlueRay, etc.
Very slow secondary storage…
- …sometimes called tertiary storage
- Tape drives for offline long-term data preservation in the scale PBs
Block Devices
Linux manages storage as “block device” (aka block storage)
- Block devices commonly represent hardware such as disks or SSDs…
- …representing the storage as a long lineup of bytes
- Users open a device…seek the place to access…read/write data
- Read/write communication is in entire blocks (of different sizes)
- Hardware characteristics are abstracted away by kernel- or driver-level cache
- Find block devices…
ls -l /dev/[vsh]d* | grep disk # list device files
dmesg | grep '[vsh][dr][a-z][1-9]' | tr -s ' ' | cut -d' ' -f3-
sysfs
used by programs such as udev
to access device and device driver information
/sys/block # contains entries for each block device
/sys/devices # global device hierarchy of all devices on the system
/sys/dev/block/ # devices with major:minor device numbers
/dev/
Block devices represented by special file system objects called device nodes…
- …visible under the
/dev
directory - Naming convention…
- …type of device followed by a letter…
- …signifying the order in which they were detected by the system
- …defined in
/lib/udev/rules.d/60-persistent-storage.rules
List of commonly used device names…
hd
IDE drives (using the old PATA driver)hda
first device on the IDE controller (master)hdb
second device (slave)
sd
SATA/PATA (originally used for SCSI)- …usually, all the devices using a serial bus
sda
first one,sdb
second one, etc.
nvme
NVM Epxress, PCI devicesnvme[0-9]
indicates the device controllernvme[0-9]n[1-9]
indicates a device on a specific controller
mmc
SD cards, MMC cards and eMMC storage devicesvda
virtio block device (virtio-blk) interface
lshw
, hwinfo
Following commands present information on the storage devices
lshw -class disk -short # disk devices...
lshw -class storage -short # storage controllers, scsi, sata, sas, etc
hwinfo --block --short # devices (and partitions)
lsblk
List block devices…
- …reads the
sysfs
andudev
db - Prints all block devices (except RAM disks) in a tree-like format by default
-o
…specify output columns--help
…list of all available columns
# device vendor information
lsblk -o NAME,VENDOR,MODEL,REV,TYPE,SIZE $dev
Types
- HDD (Hard Disk Drive)
- Latency >5ms…
- …rotational delay…get the right sector
- …seek time…move the arm to the right track
- …transfer time…get the bits from the disk
- Mechanical parts favor…
- …large sequential access
- 200x slower for random access
- Supported interfaces…
- ATA
- SCSI
- SATA
- Latency >5ms…
- SSD (Solid State Drive)…
- …no mechanical parts like HDD
- …no vibration and sound
- …no magnetism and defragmentation
- …non-volatile flash memory
- …charge stored in solid material
- Supported interfaces…
- SATA (2.5”)
- M.2
- NVMe PCIe
- U.2 PCIe
nvme
NVMe Command Line Interface (NVMe-CLI)
- Source code on GitHub https://github.com/linux-nvme/nvme-cli
- …monitor the health & endurance
- …update firmware
- …securely erase storage, and read various logs
nvme list # list all devices
nvme id-ctrl -H /dev/nvme0 # details on a specific controller
nvme error-log /dev/nvme0 # print error log page
smart-log
NVMe support was added to smartmontools
in >= 6.5…
# check health and error logs...
smartctl -H -l error /dev/nvme0
Interpret the nvme smat-log
output…
critical_warning
bits…can be tied to asynchronous event notification0
…available Spare is below Threshold1
…temperature has exceeded Threshold2
…reliability is degraded due to excessive media or internal errors3
…media is placed in Read- Only Mode4
…volatile Memory Backup System has failed5-7
…reserved
available_spare_threshold
…- When the available spare space is lower than the threshold…
- …alert that the remaining life of flash memory is insufficient
percentage_used
…estimated used endurance of the devicenum_err_log_entries
…error Information log entries over the life of the controllermedia_errors
non zero- ECC or CRC verification failure or LBA label mismatch error…
- …cannot be corrected by the error correction engine
- Non
0
means that device is not stable
Partitions
Segments the available storage space into one or more regions…
- …partition schema is limited to a single disk
- …just a continuous set of storage blocks
- …partitions on a storage device are identified by a partition table
/proc/partitions
file -s <device> # read partition info from device
dd if=/dev/zero bs=512 count=1 of=<device> # wipe the bootsector of a devices
Create a new “disk label” aka partition table
# GUID Partition Table default on all EFI systems
parted $deviec mklabel gpt
# denpricated MBR (master boot Record) or MS-DOS partition table
parted $device mklabel msdos
Create partitions:
parted -l # list devices/partitons
parted $device print free # show free strorage on device
# create single partiton using the entire device
parted -a optimal $device mkpart primary 0% 100%
parted $device rm $number # delete partition
POSIX
Large distributed applications are highly concurrent and often multi-tenant:
- POSIX was meant to provide a consistent view on data for (all) clients
- While storage systems are scaled, the time to sync data to guarantee consistency increases
- Applications can be dramatically slowed by bad I/O design (e.g. shared log-files, small I/O operations)
- Performance limits associated with lock contention force applications to a many-file I/O patterns
Highly scaleable distributed storage can not longer support the assumption that all applications look at the same view of the data. Short-term POSIX I/O may be optimized:
- I/O needs to be engineered like computer algorithms are profiled to improve performance
- Sophisticated implementation may reduce the reduces the consistency domain to a single client
- New layers in the I/O stack like burst buffers may improve efficiency of I/O in legacy applications
Long-term I/O needs to move away from POSIX (with consistent concurrent reads and writes):
- Object base storage requires data movement to be part of the application design
- “Lazy” data organization with directory trees suddenly disappears
- Performance optimizations of I/O patterns is required for each individual storage infrastructure
Eventually a distinction between read-only, write-only and read-write data is required.
- Most of the data should be read-only and immutable (after being written once) (write-once, read-many (WORM))
- Write-only data (e.g. checkpoints) data can be signed off when the writer has finished
- Data constantly in change (read-write) should live in a database
File-Systems
/proc/filesystems # list of supported file-systems
/proc/self/mountinfo # mount information
lsblk -f # list block devices with file-system type
Format a partition with a specific file-system:
mkfs.$type $partition # init fs on partition
mkfs.ext4 /dev/sdb1
File system type can have a label:
/dev/disk/by-label # list of devices partiions by label
mkfs.$type -L $label ... # add a file-system label
# set the file-system label on ext{2,3,4} file-system type partition
e2label ${part:-/dev/sda1} ${label:-root}
tune2fs -L ${label:-root} ${part:-/dev/sda1}
# change the label of an exFAT formated partition
exfatlabel ${part:-/dev/sdc1} ${label:-usb}
Multi user support with ACLs:
mnt=/mnt # mount point within the root files-ystem
part=/dev/sdc1 # for example, change this to your needs!
mkfs.ext4 $part # create a file-system with ACL support
tune2fs -o acl $part # enable ACLs
mount $part $mnt # mount the partition
chown $user: $mnt
chmod 777 $mnt
setfacl -m d:u::rwx,d:g::rwx,d:o::rwx $mnt
umount $mnt
Mounts
List file systems mounted on the system:
findmnt # show tree all file systems
findmnt -l # list all file systems
findmnt -D # output like df
findmnt -s # from /etc/fstab
findmnt -S /dev/<device> # by source device
findmnt -T <path> # by mount point
findmnt -t <type>,... # by type, e.g. nfs
Mount a partition from a storage device:
sudo mount $partition $mntpoint # mount filesystem located on a device partition
Mount a hot-plug devices like a USB drive as normal user:
sudo apt install -y pmount
pmount ${device:-/dev/sdb1} ${label:-storage}
pumount $device
The device partition is mounted below /media/$label
POSIX
POSIX I/O was designed for local storage (disks) with serial processors and workloads.
The POSIX I/O API defines how applications read/write data:
- Function calls for applications/libraries like
open()
,close()
,read()
andwrite()
- The POSIX semantics define what is guaranteed to happen with each API call
- E.g
write()
is strongly consistent and guaranteed to happen before any subsequentread()
POSIX I/O is stateful:
- File descriptors are central to this process
- The persistent state of data is maintained by tracking all file descriptors
- Typically the cost of
open()
scales linearly with the number of clients making a request
POSIX I/O prescribes a specific set of metadata that all files must possess:
- Metadata includes ownership, permissions, etc.
- Each file is treated independently, recursive changes are very costly
- The POSIX metadata schema at scales is difficult to support
Typically page cache is used to soften the latency penalty forced by POSIX consistency. Distributed storage can not efficiently use page cache since it is not shared among clients. Parallel file-systems may implement techniques like:
- No us of a page cache increasing the I/O latency for small writes
- Violate (or “relax”) POSIX consistency when clients modify non-overlapping parts of a file
- Implement a distributed lock mechanism to manage concurrency
Page Cache
Page cache accelerates accesses to files on non volatile storage for two reasons:
- Overcome the slow performance of permanent storage (like hard disk)
- Load data only once into RAM and share it between programs
The page cache uses free areas of memory as cache storage:
- All regular file I/O happens through the page cache
- Data not in sync with the storage marked as dirty pages
Dirty pages are periodically synchronized as soon as resources are available
- After programs write data to the page cache it is as marked dirty
- The program dose not block waiting for the write to be finished
- Until the sync is completed power failure may lead to data loss
- Write of critical data requires explicit blocking until data is written
- Programs reading data typically block until the is available
- The kernel uses read ahead to preload data in anticipation of sequential reads
The kernel frees the memory used for page cache if it is required for other applications:
free -hw # shows page cache in dedicated column
Force the Linux kernel to synchronize dirty pages with the storage:
sync # force write of dirty pages
# track the progress in writing dirty pages to storage:
watch -d grep -e Dirty: -e Writeback: /proc/meminfo
/proc/diskstats
I/O statistics of block devices. Each line contains the following 14 fields:
1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)
iostat
& iotop
iostat
I/O statistics for partitions…option -k
prints values in kilobytes
>>> iostat -xk 1 | awk '/sda/ {print $6,$7}'
14.36 162.23
0.00 9144.00
0.00 3028.00
...
iotop
…list of processes/threads consuming IO bandwidth
- interactive mode use the arrow keys to select the column used for sorting
o
limits the view to active processes, anda
accumulates the I/O counters.- Limit output with option
-Po
for active processes only - Option
-a
accumulates I/O-b
enables non-interactive batch mode:
>>> iotop -bPao -u $USER
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
25722 be/4 vpenso 0.00 B 8.14 M 0.00 % 0.00 % root.exe […]
25728 be/4 vpenso 0.00 B 6.75 M 0.00 % 0.00 % root.exe […]
25750 be/4 vpenso 0.00 B 8.00 K 0.00 % 0.00 % root.exe […]
25739 be/4 vpenso 0.00 B 8.57 M 0.00 % 0.00 % root.exe […]
...
Benchmark
hdparm
- …(non-destructively) write for three seconds
- …reading through the buffer cache…without any prior caching
- …without file-system overhead
- Timings of device reads options…
- …repeated 2-3 times on an otherwise inactive system
-t
…indication of how fast the drive can sustain sequential reads-T
…indication of the throughput of the processor, cache, and memory--direct
…kernelO_DIRECT
flag…bypasses the page cache…
dev=/dev/sda # adjust to a device node
for i in $(seq 3)
do
hdparm -tT --direct $dev
hdparm -tT $dev
done
Results with --direct
….
Device | Type | Size (GB) | Cached Reads | Disk Reads |
---|---|---|---|---|
SAMSUNG MZQL21T9HCJR-00B7C | NVMe | 1920 | 2720.44 | 2727.02 |
INTEL SSDSC2KB480G8 | SATA | 480 | 481.49 | 589.74 |
SAMSUNG SSD 850 | SATA | 256 | 480.49 | 491.34 |
fio
Flexible IO Tester
- …developed by the maintainer of the Linux kernel’s block IO subsystem
- References…
- Package
fio*.{rpm,deb}
- Simulate a desired I/O workload using…
- …a job file describing a setup including
- …global configuration…one or more job sections
- …
fio
parses the job file for execution
Bird view to the job file configuration…
- I/O pattern…sequential, random, mixed…
- Block size…
- I/O size…overall data read/write
- I/O engine…how the job issues I/O
- I/O depth…for
async
I/O engine - Targets…number of files and workloads
- Threads/Processes…should we spread this workload over
Very simple benchmark example…
# create a job file
cat > /tmp/simple.fio <<EOF
[job]
filename=/tmp/test.file
filesize=1g
readwrite=randread
bs=4k
EOF
# execute the benchmark
fio /tmp/simple.fio
Job file format…
--cmdhelp
lists all options- Examples in
/usr/share/doc/fio/examples