Linux Storage

Linux

Storage

Published

August 27, 2013

Modified

July 14, 2022

Terminology

Data…set of bits in a specific order

Storage device…physical storage…

…also called storage media or storage medium
Physical (second stage) storage…
- HDD (Hard Disk Drive)
- SSD (Solid State Drive)
- CDs, DVDs, Flash

Drive single piece (virtual) storage device

Physical device can store data temporary or permanent…

Temporary volatile memory requires power to maintain stored data
Non-volatile persistent storage retains stored data after power off

Random access devices use an abstraction layer called block device

Volumes are a logical abstraction from storage devices

…which typically span over multiple physical devices
Volumes can contain multiple partitions

Storage devices (and volumes) can be segmented into one of more partitions

A “raw” storage device is initialized with a file-system structure…

…controls how information is written to and retrieved from a device
Otherwise the storage could not be used for file related operations
Application created the file I/O request…
- …the file system creates a block I/O request…
- …block I/O drive accesses the disk

I/O

I/O (Input/Output)…

I/O is issued to a storage device
…an I/O is a single read/write request
IOPS (I/O Operations Per Second)

Reading or writing a file can result in multiple I/O requests

I/O request have a size…
- …workloads issue I/O operations with different request sizes
- Request size impact latency and IOPS
Queue depth…number of I/O requests queued (in-flight) on average…
- …used to optimize access to the storage device
- …improves throughput at the cost of latency

Access Patterns…

Sequential
- …operates with a large number of sequential (often adjacent) data blocks
- …may achieve the highest possible throughput on a storage device
Random
- …I/O requests issued in a seemingly random pattern to the storage device
- …throughput and IOPS will plummet (as compared to a sequential access pattern)

Latency

A storage hierarchy separates storage into a layers

…based on latency (response time)
…fast and large storage can not be achieved with a single level
…multiple levels of storage progressively bigger and slower

Typical latency for different storage layers:

Name	Latency	Size
Register	<1ns	B
L1 cache	~1ns	~32KB
L2 cache	>1ns	<1MB
L3 cache	>10ns	>1MB
DRAM	>100ns	GB
SSD	1-3ms	TB
HDD	5-18ms	TB

Latency…

…time it takes for the I/O request to be completed
Dictates the responsiveness of individual I/O operations
IOPS metric is meaningless without a statement about latency

Throughput

Performance Indicators:

Throughput (Tp) – Volume of data processes within a specific time interval.
Transactions (Tr) – I/O requests processed by the device in a specific time interval.
Average Latency (Al) – Average time for processing a single I/O request.

Throughput and transaction rate are proportional (Block size (Bs))

Tp [MB/s] = Tr [IO/s] × Bs [MB]
Tr [IO/s] = Tp [MB/s] ÷ Bs [MB]

Number of Worker Threads (Wt), Parallel I/Os (P)

Al [ms] = 10³ × Wt × P ÷ Tr [IO/s]

Data transferred with is done in multiples of the block size. It is (usually) the unit of allocation on the device.

The simples test is to write to the file-system with dd:

>>> dd if=/dev/zero of=/tmp/output conv=fdatasync bs=384k count=1k; rm -f /tmp/output
1024+0 records in
1024+0 records out
402653184 bytes (403 MB) copied, 4.28992 s, 93.9 MB/s
>>> hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads:   15852 MB in  2.00 seconds = 7934.52 MB/sec
Timing buffered disk reads: 302 MB in  3.02 seconds = 100.10 MB/sec

Similarly hdparm can run a quit I/O test.

Hierarchy

As distance to the processor increases…

…storage size and access time increases
Higher levels are faster…but more expensive

CPU Cache

Internal processor registers integrated (small & fast) memory

Registers are read/written by machine instructions
Categories: state-, address-, and data-registers

Typically three levels of cache memory…

L1 Cache - fastest…with the least storage capacity
L2 Cache - not as fast…more storage capacity
L3 Cache - even less fast..even more storage capacity (SDRAM in certain cases)

Main Memory

Primary storage

Operating at high speed compared to secondary storage
Usually too small to store all needed programs and data permanently
Typically volatile storage device… loses contents when power off

Storage Drives

Secondary storage…usually called disk…

…slower than main memory (RAM)
can hold the data permanent
Mass storage devices…
- HDD (Hard Disk Drive)
- SSD (Solid State Drive)
Flash memory like USB & SD flash drives and solid-state drives (SSD)
Optical media like CDs, DVDs, BlueRay, etc.

Very slow secondary storage…

…sometimes called tertiary storage
Tape drives for offline long-term data preservation in the scale PBs

Block Devices

Linux manages storage as “block device” (aka block storage)

Block devices commonly represent hardware such as disks or SSDs…
- …representing the storage as a long lineup of bytes
- Users open a device…seek the place to access…read/write data
- Read/write communication is in entire blocks (of different sizes)
- Hardware characteristics are abstracted away by kernel- or driver-level cache
Find block devices…

ls -l /dev/[vsh]d* | grep disk             # list device files
dmesg | grep '[vsh][dr][a-z][1-9]' | tr -s ' ' | cut -d' ' -f3-

sysfs used by programs such as udev to access device and device driver information

/sys/block               # contains entries for each block device
/sys/devices             # global device hierarchy of all devices on the system
/sys/dev/block/          # devices with major:minor device numbers

`/dev/`

Block devices represented by special file system objects called device nodes…

…visible under the /dev directory
Naming convention…
- …type of device followed by a letter…
- …signifying the order in which they were detected by the system
- …defined in /lib/udev/rules.d/60-persistent-storage.rules

List of commonly used device names…

hd IDE drives (using the old PATA driver)
- hda first device on the IDE controller (master)
- hdb second device (slave)
sd SATA/PATA (originally used for SCSI)
- …usually, all the devices using a serial bus
- sda first one, sdb second one, etc.
nvme NVM Epxress, PCI devices
- nvme[0-9] indicates the device controller
- nvme[0-9]n[1-9] indicates a device on a specific controller
mmc SD cards, MMC cards and eMMC storage devices
vda virtio block device (virtio-blk) interface

`lshw`, `hwinfo`

Following commands present information on the storage devices

lshw -class disk -short           # disk devices...
lshw -class storage -short        # storage controllers, scsi, sata, sas, etc
hwinfo --block --short            # devices (and partitions)

`lsblk`

List block devices…

…reads the sysfs and udev db
Prints all block devices (except RAM disks) in a tree-like format by default
-o…specify output columns
- --help…list of all available columns

# device vendor information
lsblk -o NAME,VENDOR,MODEL,REV,TYPE,SIZE $dev

Types

HDD (Hard Disk Drive)
- Latency >5ms…
  - …rotational delay…get the right sector
  - …seek time…move the arm to the right track
  - …transfer time…get the bits from the disk
- Mechanical parts favor…
  - …large sequential access
  - 200x slower for random access
- Supported interfaces…
  - ATA
  - SCSI
  - SATA
SSD (Solid State Drive)…
- …no mechanical parts like HDD
- …no vibration and sound
- …no magnetism and defragmentation
- …non-volatile flash memory
- …charge stored in solid material
- Supported interfaces…
  - SATA (2.5”)
  - M.2
  - NVMe PCIe
  - U.2 PCIe

`nvme`

NVMe Command Line Interface (NVMe-CLI)

Source code on GitHub https://github.com/linux-nvme/nvme-cli
- …monitor the health & endurance
- …update firmware
- …securely erase storage, and read various logs

nvme list                      # list all devices
nvme id-ctrl -H /dev/nvme0     # details on a specific controller
nvme error-log /dev/nvme0      # print error log page

`smart-log`

NVMe support was added to smartmontools in >= 6.5…

https://www.smartmontools.org/wiki/NVMe_Support

# check health and error logs...
smartctl -H -l error /dev/nvme0

Interpret the nvme smat-log output…

critical_warning bits…can be tied to asynchronous event notification
- 0…available Spare is below Threshold
- 1…temperature has exceeded Threshold
- 2…reliability is degraded due to excessive media or internal errors
- 3…media is placed in Read- Only Mode
- 4…volatile Memory Backup System has failed
- 5-7…reserved
available_spare_threshold…
- When the available spare space is lower than the threshold…
- …alert that the remaining life of flash memory is insufficient
percentage_used…estimated used endurance of the device
num_err_log_entries…error Information log entries over the life of the controller
media_errors non zero
- ECC or CRC verification failure or LBA label mismatch error…
- …cannot be corrected by the error correction engine
- Non 0 means that device is not stable

Partitions

Segments the available storage space into one or more regions…

…partition schema is limited to a single disk
…just a continuous set of storage blocks
…partitions on a storage device are identified by a partition table

/proc/partitions
file -s <device>                               # read partition info from device
dd if=/dev/zero bs=512 count=1 of=<device>     # wipe the bootsector of a devices

Create a new “disk label” aka partition table

# GUID Partition Table default on all EFI systems
parted $deviec mklabel gpt
# denpricated MBR (master boot Record) or MS-DOS partition table
parted $device mklabel msdos

Create partitions:

parted -l                                      # list devices/partitons
parted $device print free                     # show free strorage on device
# create single partiton using the entire device
parted -a optimal $device mkpart primary 0% 100%
parted $device rm $number                      # delete partition

POSIX

Large distributed applications are highly concurrent and often multi-tenant:

POSIX was meant to provide a consistent view on data for (all) clients
While storage systems are scaled, the time to sync data to guarantee consistency increases
Applications can be dramatically slowed by bad I/O design (e.g. shared log-files, small I/O operations)
Performance limits associated with lock contention force applications to a many-file I/O patterns

Highly scaleable distributed storage can not longer support the assumption that all applications look at the same view of the data. Short-term POSIX I/O may be optimized:

I/O needs to be engineered like computer algorithms are profiled to improve performance
Sophisticated implementation may reduce the reduces the consistency domain to a single client
New layers in the I/O stack like burst buffers may improve efficiency of I/O in legacy applications

Long-term I/O needs to move away from POSIX (with consistent concurrent reads and writes):

Object base storage requires data movement to be part of the application design
“Lazy” data organization with directory trees suddenly disappears
Performance optimizations of I/O patterns is required for each individual storage infrastructure

Eventually a distinction between read-only, write-only and read-write data is required.

Most of the data should be read-only and immutable (after being written once) (write-once, read-many (WORM))
Write-only data (e.g. checkpoints) data can be signed off when the writer has finished
Data constantly in change (read-write) should live in a database

File-Systems

/proc/filesystems                         # list of supported file-systems
/proc/self/mountinfo                      # mount information
lsblk -f                                  # list block devices with file-system type

Format a partition with a specific file-system:

mkfs.$type $partition                     # init fs on partition
mkfs.ext4 /dev/sdb1

File system type can have a label:

/dev/disk/by-label                        # list of devices partiions by label
mkfs.$type -L $label ...                  # add a file-system label
# set the file-system label on ext{2,3,4} file-system type partition 
e2label ${part:-/dev/sda1} ${label:-root}
tune2fs -L ${label:-root} ${part:-/dev/sda1}
# change the label of an exFAT formated partition
exfatlabel ${part:-/dev/sdc1} ${label:-usb}

Multi user support with ACLs:

mnt=/mnt                      # mount point within the root files-ystem
part=/dev/sdc1                # for example, change this to your needs!
mkfs.ext4 $part               # create a file-system with ACL support
tune2fs -o acl $part          # enable ACLs
mount $part $mnt              # mount the partition
chown $user: $mnt
chmod 777 $mnt
setfacl -m d:u::rwx,d:g::rwx,d:o::rwx $mnt
umount $mnt

Mounts

List file systems mounted on the system:

findmnt                                   # show tree all file systems
findmnt -l                                # list all file systems
findmnt -D                                # output like df
findmnt -s                                # from /etc/fstab
findmnt -S /dev/<device>                  # by source device
findmnt -T <path>                         # by mount point
findmnt -t <type>,...                     # by type, e.g. nfs

Mount a partition from a storage device:

sudo mount $partition $mntpoint    # mount filesystem located on a device partition

Mount a hot-plug devices like a USB drive as normal user:

sudo apt install -y pmount
pmount ${device:-/dev/sdb1} ${label:-storage}
pumount $device

The device partition is mounted below /media/$label

POSIX

POSIX I/O was designed for local storage (disks) with serial processors and workloads.

The POSIX I/O API defines how applications read/write data:

Function calls for applications/libraries like open(), close(), read() and write()
The POSIX semantics define what is guaranteed to happen with each API call
E.g write() is strongly consistent and guaranteed to happen before any subsequent read()

POSIX I/O is stateful:

File descriptors are central to this process
The persistent state of data is maintained by tracking all file descriptors
Typically the cost of open() scales linearly with the number of clients making a request

POSIX I/O prescribes a specific set of metadata that all files must possess:

Metadata includes ownership, permissions, etc.
Each file is treated independently, recursive changes are very costly
The POSIX metadata schema at scales is difficult to support

Typically page cache is used to soften the latency penalty forced by POSIX consistency. Distributed storage can not efficiently use page cache since it is not shared among clients. Parallel file-systems may implement techniques like:

No us of a page cache increasing the I/O latency for small writes
Violate (or “relax”) POSIX consistency when clients modify non-overlapping parts of a file
Implement a distributed lock mechanism to manage concurrency

Page Cache

Page cache accelerates accesses to files on non volatile storage for two reasons:

Overcome the slow performance of permanent storage (like hard disk)
Load data only once into RAM and share it between programs

The page cache uses free areas of memory as cache storage:

All regular file I/O happens through the page cache
Data not in sync with the storage marked as dirty pages

Dirty pages are periodically synchronized as soon as resources are available

After programs write data to the page cache it is as marked dirty
The program dose not block waiting for the write to be finished
Until the sync is completed power failure may lead to data loss
Write of critical data requires explicit blocking until data is written
Programs reading data typically block until the is available
The kernel uses read ahead to preload data in anticipation of sequential reads

The kernel frees the memory used for page cache if it is required for other applications:

free -hw                  # shows page cache in dedicated column

Force the Linux kernel to synchronize dirty pages with the storage:

sync                      # force write of dirty pages
# track the progress in writing dirty pages to storage:
watch -d grep -e Dirty: -e Writeback: /proc/meminfo

`/proc/diskstats`

I/O statistics of block devices. Each line contains the following 14 fields:

1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)

`iostat` & `iotop`

iostat I/O statistics for partitions…option -k prints values in kilobytes

>>> iostat -xk 1  | awk '/sda/ {print $6,$7}'                  
14.36 162.23
0.00 9144.00
0.00 3028.00
...

iotop…list of processes/threads consuming IO bandwidth

interactive mode use the arrow keys to select the column used for sorting
o limits the view to active processes, and a accumulates the I/O counters.
Limit output with option -Po for active processes only
Option -a accumulates I/O -b enables non-interactive batch mode:

>>> iotop -bPao -u $USER
Total DISK READ:       0.00 B/s | Total DISK WRITE:       0.00 B/s
PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
25722 be/4 vpenso        0.00 B      8.14 M  0.00 %  0.00 % root.exe […]
25728 be/4 vpenso        0.00 B      6.75 M  0.00 %  0.00 % root.exe […]
25750 be/4 vpenso        0.00 B      8.00 K  0.00 %  0.00 % root.exe […]
25739 be/4 vpenso        0.00 B      8.57 M  0.00 %  0.00 % root.exe […]
...

Benchmark

`hdparm`

…(non-destructively) write for three seconds
…reading through the buffer cache…without any prior caching
…without file-system overhead
Timings of device reads options…
- …repeated 2-3 times on an otherwise inactive system
- -t…indication of how fast the drive can sustain sequential reads
- -T…indication of the throughput of the processor, cache, and memory
- --direct…kernel O_DIRECT flag…bypasses the page cache…

dev=/dev/sda   # adjust to a device node
for i in $(seq 3) 
do  
        hdparm -tT --direct $dev
        hdparm -tT $dev
done

Results with --direct….

Device	Type	Size (GB)	Cached Reads	Disk Reads
SAMSUNG MZQL21T9HCJR-00B7C	NVMe	1920	2720.44	2727.02
INTEL SSDSC2KB480G8	SATA	480	481.49	589.74
SAMSUNG SSD 850	SATA	256	480.49	491.34

`fio`

Flexible IO Tester

…developed by the maintainer of the Linux kernel’s block IO subsystem
References…
- https://fio.readthedocs.io/en/latest/fio_doc.html
- https://github.com/axboe/fio
Package fio*.{rpm,deb}
Simulate a desired I/O workload using…
- …a job file describing a setup including
- …global configuration…one or more job sections
- …fio parses the job file for execution

Bird view to the job file configuration…

I/O pattern…sequential, random, mixed…
Block size…
I/O size…overall data read/write
I/O engine…how the job issues I/O
I/O depth…for async I/O engine
Targets…number of files and workloads
Threads/Processes…should we spread this workload over

Very simple benchmark example…

# create a job file
cat > /tmp/simple.fio <<EOF
[job]
filename=/tmp/test.file
filesize=1g
readwrite=randread
bs=4k
EOF
# execute the benchmark
fio /tmp/simple.fio

Job file format…

--cmdhelp lists all options
Examples in /usr/share/doc/fio/examples

--- title: Linux Storage categories: - Linux - Storage date: 2013/08/27 date-modified: 2022/07/14 toc-expand: 3 --- # Terminology **Data**...set of bits in a specific order **Storage device**...physical storage... - ...also called storage media or storage medium - Physical (second stage) storage... - HDD (Hard Disk Drive) - SSD (Solid State Drive) - CDs, DVDs, Flash **Drive** single piece (virtual) storage device Physical device can store data temporary or permanent... * Temporary **volatile** memory requires power to maintain stored data * Non-volatile **persistent** storage retains stored data after power off Random access devices use an abstraction layer called **block device** **Volumes** are a logical abstraction from storage devices - ...which typically span over multiple physical devices - Volumes can contain multiple partitions Storage devices (and volumes) can be segmented into one of more **partitions** A "raw" storage device is initialized with a **file-system** structure... - ...controls how information is written to and retrieved from a device - Otherwise the storage could not be used for file related operations - Application created the file I/O request... - ...the file system creates a block I/O request... - ...block I/O drive accesses the disk ## I/O I/O (Input/Output)... - I/O is issued to a storage device - ...an I/O is a single read/write request - **IOPS** (I/O Operations Per Second) Reading or writing a file can result in multiple I/O requests - I/O request have a size... - ...workloads issue I/O operations with different request sizes - Request size impact latency and IOPS - **Queue depth**...number of I/O requests queued (in-flight) on average... - ...used to optimize access to the storage device - ...improves throughput at the cost of latency Access Patterns... - **Sequential** - ...operates with a large number of sequential (often adjacent) data blocks - ...may achieve the highest possible throughput on a storage device - **Random** - ...I/O requests issued in a seemingly random pattern to the storage device - ...throughput and IOPS will plummet (as compared to a sequential access pattern) ## Latency A storage hierarchy separates storage into a layers - ...based on **latency** (response time) - ...fast and large storage can not be achieved with a single level - ...multiple levels of storage progressively bigger and slower Typical latency for different storage layers: Name | Latency | Size ----------|---------|---------------- Register | <1ns | B L1 cache | ~1ns | ~32KB L2 cache | >1ns | <1MB L3 cache | >10ns | >1MB DRAM | >100ns | GB SSD | 1-3ms | TB HDD | 5-18ms | TB Latency... - ...time it takes for the I/O request to be completed - Dictates the responsiveness of individual I/O operations - IOPS metric is meaningless without a statement about latency ## Throughput Performance Indicators: * Throughput (Tp) – Volume of data processes within a specific time interval. * Transactions (Tr) – I/O requests processed by the device in a specific time interval. * Average Latency (Al) – Average time for processing a single I/O request. Throughput and transaction rate are proportional (Block size (Bs)) Tp [MB/s] = Tr [IO/s] × Bs [MB] Tr [IO/s] = Tp [MB/s] ÷ Bs [MB] Number of Worker Threads (Wt), Parallel I/Os (P) Al [ms] = 10³ × Wt × P ÷ Tr [IO/s] Data transferred with is done in multiples of the block size. It is (usually) the unit of allocation on the device. The simples test is to write to the file-system with `dd`: ```sh >>> dd if=/dev/zero of=/tmp/output conv=fdatasync bs=384k count=1k; rm -f /tmp/output 1024+0 records in 1024+0 records out 402653184 bytes (403 MB) copied, 4.28992 s, 93.9 MB/s >>> hdparm -Tt /dev/sda /dev/sda: Timing cached reads: 15852 MB in 2.00 seconds = 7934.52 MB/sec Timing buffered disk reads: 302 MB in 3.02 seconds = 100.10 MB/sec ``` Similarly `hdparm` can run a quit I/O test. ## Hierarchy As distance to the processor increases... - ...storage size and access time increases - Higher levels are faster...but more expensive ### CPU Cache Internal processor **registers** integrated (small & fast) memory - Registers are read/written by machine instructions - Categories: **state-**, **address-**, and **data-registers** Typically three levels of cache memory... - L1 Cache - fastest...with the least storage capacity - L2 Cache - not as fast...more storage capacity - L3 Cache - even less fast..even more storage capacity (SDRAM in certain cases) ### Main Memory **Primary storage** - Operating at high speed compared to secondary storage - Usually too small to store all needed programs and data permanently - Typically **volatile** storage device... loses contents when power off ### Storage Drives **Secondary storage**...usually called disk... - ...slower than main memory (RAM) - can hold the data **permanent** - Mass storage devices... - **HDD** (Hard Disk Drive) - **SSD** (Solid State Drive) - Flash memory like **USB** & **SD** flash drives and solid-state drives (SSD) - Optical media like CDs, DVDs, BlueRay, etc. Very slow secondary storage... - ...sometimes called tertiary storage - Tape drives for offline long-term data preservation in the scale PBs # Block Devices Linux manages storage as "block device" (aka block storage) - Block devices commonly represent hardware such as disks or SSDs... - ...representing the storage as a long lineup of bytes - Users open a device...seek the place to access...read/write data - Read/write communication is in entire blocks (of different sizes) - Hardware characteristics are abstracted away by kernel- or driver-level cache - Find block devices... ```sh ls -l /dev/[vsh]d* | grep disk # list device files dmesg | grep '[vsh][dr][a-z][1-9]' | tr -s ' ' | cut -d' ' -f3- ``` `sysfs` used by programs such as `udev` to access device and device driver information ```sh /sys/block # contains entries for each block device /sys/devices # global device hierarchy of all devices on the system /sys/dev/block/ # devices with major:minor device numbers ``` ## `/dev/` Block devices represented by special file system objects called **device nodes**... - ...visible under the `/dev` directory - Naming convention... - ...type of device followed by a letter... - ...signifying the order in which they were detected by the system - ...defined in `/lib/udev/rules.d/60-persistent-storage.rules` List of commonly used device names... - `hd` IDE drives (using the old PATA driver) - `hda` first device on the IDE controller (master) - `hdb` second device (slave) - `sd` SATA/PATA (originally used for SCSI) - ...usually, all the devices using a serial bus - `sda` first one, `sdb` second one, etc. - `nvme` NVM Epxress, PCI devices - `nvme[0-9]` indicates the device controller - `nvme[0-9]n[1-9]` indicates a device on a specific controller - `mmc` SD cards, MMC cards and eMMC storage devices - `vda` virtio block device (virtio-blk) interface ### `lshw`, `hwinfo` Following commands present information on the storage devices ```sh lshw -class disk -short # disk devices... lshw -class storage -short # storage controllers, scsi, sata, sas, etc hwinfo --block --short # devices (and partitions) ``` ### `lsblk` List block devices... - ...reads the `sysfs` and `udev` db - Prints all block devices (except RAM disks) in a tree-like format by default - `-o`...specify output columns - `--help`...list of all available columns ```sh # device vendor information lsblk -o NAME,VENDOR,MODEL,REV,TYPE,SIZE $dev ``` ## Types - HDD (Hard Disk Drive) - Latency >5ms... - ...rotational delay...get the right sector - ...seek time...move the arm to the right track - ...transfer time...get the bits from the disk - Mechanical parts favor... - ...large sequential access - 200x slower for random access - Supported interfaces... - ATA - SCSI - SATA - SSD (Solid State Drive)... - ...no mechanical parts like HDD - ...no vibration and sound - ...no magnetism and defragmentation - ...non-volatile flash memory - ...charge stored in solid material - Supported interfaces... - SATA (2.5") - M.2 - NVMe PCIe - U.2 PCIe ## `nvme` NVMe Command Line Interface (NVMe-CLI) - Source code on GitHub <https://github.com/linux-nvme/nvme-cli> - ...monitor the health & endurance - ...update firmware - ...securely erase storage, and read various logs ```sh nvme list # list all devices nvme id-ctrl -H /dev/nvme0 # details on a specific controller nvme error-log /dev/nvme0 # print error log page ``` ### `smart-log` NVMe support was added to `smartmontools` in >= 6.5... - <https://www.smartmontools.org/wiki/NVMe_Support> ```sh # check health and error logs... smartctl -H -l error /dev/nvme0 ``` Interpret the `nvme smat-log` output... - `critical_warning` bits...can be tied to asynchronous event notification - `0`...available Spare is below Threshold - `1`...temperature has exceeded Threshold - `2`...reliability is degraded due to excessive media or internal errors - `3`...media is placed in Read- Only Mode - `4`...volatile Memory Backup System has failed - `5-7`...reserved - `available_spare_threshold`... - When the available spare space is lower than the threshold... - ...alert that the remaining life of flash memory is insufficient - `percentage_used`...estimated used endurance of the device - `num_err_log_entries`...error Information log entries over the life of the controller - `media_errors` non zero - ECC or CRC verification failure or LBA label mismatch error... - ...cannot be corrected by the error correction engine - Non `0` means that device is not stable # Partitions Segments the available storage space into one or more regions... - ...partition schema is limited to a single disk - ...just a continuous set of storage blocks - ...partitions on a storage device are identified by a **partition table** ```bash /proc/partitions file -s <device> # read partition info from device dd if=/dev/zero bs=512 count=1 of=<device> # wipe the bootsector of a devices ``` Create a new "disk label" aka partition table ```sh # GUID Partition Table default on all EFI systems parted $deviec mklabel gpt # denpricated MBR (master boot Record) or MS-DOS partition table parted $device mklabel msdos ``` Create partitions: ```bash parted -l # list devices/partitons parted $device print free # show free strorage on device # create single partiton using the entire device parted -a optimal $device mkpart primary 0% 100% parted $device rm $number # delete partition ``` # POSIX Large distributed applications are highly concurrent and often multi-tenant: * POSIX was meant to provide a consistent view on data for (all) clients * While storage systems are scaled, the time to sync data to guarantee consistency increases * Applications can be dramatically slowed by bad I/O design (e.g. shared log-files, small I/O operations) * Performance limits associated with **lock contention** force applications to a many-file I/O patterns Highly scaleable distributed storage can not longer support the assumption that all applications look at the same view of the data. Short-term POSIX I/O may be optimized: * **I/O needs to be engineered** like computer algorithms are profiled to improve performance * Sophisticated implementation may reduce the **reduces the consistency domain** to a single client * New layers in the I/O stack like **burst buffers** may improve efficiency of I/O in legacy applications Long-term I/O needs to move away from POSIX (with consistent concurrent reads and writes): * **Object base storage** requires data movement to be part of the application design * "Lazy" data organization with directory trees suddenly disappears * Performance optimizations of I/O patterns is required for each individual storage infrastructure Eventually a distinction between read-only, write-only and read-write data is required. * Most of the data should be read-only and immutable (after being written once) (write-once, read-many (WORM)) * Write-only data (e.g. checkpoints) data can be signed off when the writer has finished * Data constantly in change (read-write) should live in a database # File-Systems ```bash /proc/filesystems # list of supported file-systems /proc/self/mountinfo # mount information lsblk -f # list block devices with file-system type ``` Format a partition with a specific file-system: ```bash mkfs.$type $partition # init fs on partition mkfs.ext4 /dev/sdb1 ``` File system type can have a **label**: ```bash /dev/disk/by-label # list of devices partiions by label mkfs.$type -L $label ... # add a file-system label # set the file-system label on ext{2,3,4} file-system type partition e2label ${part:-/dev/sda1} ${label:-root} tune2fs -L ${label:-root} ${part:-/dev/sda1} # change the label of an exFAT formated partition exfatlabel ${part:-/dev/sdc1} ${label:-usb} ``` Multi user support with ACLs: ```bash mnt=/mnt # mount point within the root files-ystem part=/dev/sdc1 # for example, change this to your needs! mkfs.ext4 $part # create a file-system with ACL support tune2fs -o acl $part # enable ACLs mount $part $mnt # mount the partition chown $user: $mnt chmod 777 $mnt setfacl -m d:u::rwx,d:g::rwx,d:o::rwx $mnt umount $mnt ``` ## Mounts List file systems mounted on the system: ```bash findmnt # show tree all file systems findmnt -l # list all file systems findmnt -D # output like df findmnt -s # from /etc/fstab findmnt -S /dev/<device> # by source device findmnt -T <path> # by mount point findmnt -t <type>,... # by type, e.g. nfs ``` Mount a partition from a storage device: ```bash sudo mount $partition $mntpoint # mount filesystem located on a device partition ``` Mount a hot-plug devices like a USB drive as normal user: ```bash sudo apt install -y pmount pmount ${device:-/dev/sdb1} ${label:-storage} pumount $device ``` The device partition is mounted below `/media/$label` ## POSIX POSIX I/O was designed for local storage (disks) with serial processors and workloads. The **POSIX I/O API** defines how applications read/write data: * Function calls for applications/libraries like `open()`, `close()`, `read()` and `write()` * The **POSIX semantics** define what is guaranteed to happen with each API call * E.g `write()` is **strongly consistent** and guaranteed to happen before any subsequent `read()` POSIX I/O is **stateful**: * **File descriptors** are central to this process * The persistent state of data is maintained by tracking all file descriptors * Typically the cost of **`open()` scales linearly** with the number of clients making a request POSIX I/O prescribes a **specific set of metadata** that all files must possess: * Metadata includes ownership, permissions, etc. * Each file is treated independently, recursive changes are very costly * The POSIX metadata schema at scales is difficult to support Typically page cache is used to soften the latency penalty forced by POSIX consistency. Distributed storage can not efficiently use page cache since it is not shared among clients. Parallel file-systems may implement techniques like: * No us of a page cache increasing the I/O latency for small writes * Violate (or “relax”) POSIX consistency when clients modify non-overlapping parts of a file * Implement a distributed lock mechanism to manage concurrency ## Page Cache Page cache **accelerates accesses to files** on non volatile storage for two reasons: 1. Overcome the slow performance of permanent storage (like hard disk) 2. Load data only once into RAM and share it between programs The page cache uses **free areas of memory as cache storage**: * All **regular file I/O happens through the page cache** * Data not in sync with the storage marked as **dirty pages** Dirty pages are periodically synchronized as soon as resources are available * After programs write data to the page cache it is as marked dirty * The program dose not block waiting for the write to be finished * Until the sync is completed power failure may lead to data loss * Write of critical data requires explicit blocking until data is written * Programs reading data typically block until the is available * The kernel uses **read ahead** to preload data in anticipation of sequential reads The kernel frees the memory used for page cache if it is required for other applications: ```bash free -hw # shows page cache in dedicated column ``` Force the Linux kernel to **synchronize dirty pages** with the storage: ```bash sync # force write of dirty pages # track the progress in writing dirty pages to storage: watch -d grep -e Dirty: -e Writeback: /proc/meminfo ``` ## `/proc/diskstats` I/O statistics of block devices. Each line contains the following 14 fields: ``` 1 - major number 2 - minor mumber 3 - device name 4 - reads completed successfully 5 - reads merged 6 - sectors read 7 - time spent reading (ms) 8 - writes completed 9 - writes merged 10 - sectors written 11 - time spent writing (ms) 12 - I/Os currently in progress 13 - time spent doing I/Os (ms) 14 - weighted time spent doing I/Os (ms) ``` ## `iostat` & `iotop` `iostat` I/O statistics for partitions...option `-k` prints values in kilobytes ```sh >>> iostat -xk 1 | awk '/sda/ {print $6,$7}' 14.36 162.23 0.00 9144.00 0.00 3028.00 ... ``` `iotop`...list of processes/threads consuming IO bandwidth - interactive mode use the arrow keys to select the column used for sorting - `o` limits the view to active processes, and `a` accumulates the I/O counters. - Limit output with option `-Po` for active processes only - Option `-a` accumulates I/O `-b` enables non-interactive batch mode: ```sh >>> iotop -bPao -u $USER Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 25722 be/4 vpenso 0.00 B 8.14 M 0.00 % 0.00 % root.exe […] 25728 be/4 vpenso 0.00 B 6.75 M 0.00 % 0.00 % root.exe […] 25750 be/4 vpenso 0.00 B 8.00 K 0.00 % 0.00 % root.exe […] 25739 be/4 vpenso 0.00 B 8.57 M 0.00 % 0.00 % root.exe […] ... ``` # Benchmark ## `hdparm` - ...(non-destructively) write for three seconds - ...reading through the buffer cache...**without any prior caching** - ...**without file-system overhead** - Timings of device reads options... - ...repeated 2-3 times on an otherwise inactive system - `-t`...indication of how fast the drive can sustain sequential reads - `-T`...indication of the throughput of the processor, cache, and memory - `--direct`...kernel `O_DIRECT` flag...bypasses the page cache... ```sh dev=/dev/sda # adjust to a device node for i in $(seq 3) do hdparm -tT --direct $dev hdparm -tT $dev done ``` Results with `--direct`.... Device | Type | Size (GB) | Cached Reads | Disk Reads -----------------------------|-------|-----------|--------------|------------- SAMSUNG MZQL21T9HCJR-00B7C | NVMe | 1920 | 2720.44 | 2727.02 INTEL SSDSC2KB480G8 | SATA | 480 | 481.49 | 589.74 SAMSUNG SSD 850 | SATA | 256 | 480.49 | 491.34 ## `fio` Flexible IO Tester - ...developed by the maintainer of the Linux kernel's block IO subsystem - References... - <https://fio.readthedocs.io/en/latest/fio_doc.html> - <https://github.com/axboe/fio> - Package `fio*.{rpm,deb}` - Simulate a desired I/O workload using... - ...a **job file** describing a setup including - ...global configuration...one or more job sections - ...`fio` parses the job file for execution Bird view to the job file configuration... - I/O pattern...sequential, random, mixed... - Block size... - I/O size...overall data read/write - I/O engine...how the job issues I/O - I/O depth...for `async` I/O engine - Targets...number of files and workloads - Threads/Processes...should we spread this workload over Very simple benchmark example... ```sh # create a job file cat > /tmp/simple.fio <<EOF [job] filename=/tmp/test.file filesize=1g readwrite=randread bs=4k EOF # execute the benchmark fio /tmp/simple.fio ``` Job file format... - `--cmdhelp` lists all options - Examples in `/usr/share/doc/fio/examples`

Terminology

I/O

Latency

Throughput

Hierarchy

CPU Cache

Main Memory

Storage Drives

Block Devices

/dev/

lshw, hwinfo

lsblk

Types

nvme

smart-log

Partitions

POSIX

File-Systems

Mounts

POSIX

Page Cache

/proc/diskstats

iostat & iotop

Benchmark

hdparm

fio

`/dev/`

`lshw`, `hwinfo`

`lsblk`

`nvme`

`smart-log`

`/proc/diskstats`

`iostat` & `iotop`

`hdparm`

`fio`