Lustre HPC Storage System
Client Installation & Configuration
Lean about Lustre from following resources…
Events in the Lustre community …LUG (Lustre User Group)
Packages
Available versions…
- Long-term support (LTS) …stable release recommended for production environments
- Feature release …ongoing development
- Native Linux Kernel …support build into the Linux kernel (considered unstable)
Packages available…
- …from downloads.whamcloud.com …details at build.whamcloud.com
- …list of releases …roadmap …kernel support matrix
Long-Term Support
Version | Date | Platform |
---|---|---|
2.15.0 | 2022/06 | EL 8.5 |
2.15.1 | 2022/08 | EL 8.6 |
2.15.2 | 2023/01 | EL 8.7, EL 9.0 |
2.15.3 | 2023/06 | EL 8.8, EL 9.2 |
2.15.4 | 2023/12 | EL 8.9, EL 9.3 |
Previous LTS…
Date | Version | Platform |
---|---|---|
2020/06 | 2.12.5 | EL 7.8, EL 8.2 |
2020/12 | 2.12.6 | EL 7.9, EL 8.3 |
2021/07 | 2.12.7 | EL 7.9, EL 8.4 |
2021/12 | 2.12.8 | EL 7.9, EL 8.5 |
2022/06 | 2.12.9 | EL 7.9, EL 8.6 |
Feature Releases
Version | Date | Platform |
---|---|---|
2.13 | 2019/12 | EL 7.7, EL 8.0 |
2.14 | 2021/02 | EL 8.3 |
2.16 | 2023/Q3 | - |
2.17 | 2024/Q3 | - |
Each version has another build designated with an -ib
suffix which includes the lib{ib,rdma}
libraries as well as the client kernel modules build against LTS version of {ref}mlnx_ofed
. Lustre client packages used in production are added to the local repository into a sub-directory called lustre/
.
Quick Start
Following exemplifies how install and configure the Lustre client manually:
# install a specific version of the client package
dnf install -y lustre-client
# LNet configuration
cat <<EOF > /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0)"
EOF
# load the kernel modules
modprobe lustre
modinfo lustre
# create the mount point
mkdir -p /lustre/alice
# mount the file-system
mount -t lustre 10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice /lustre/alice
# remove the mount
umount /lustre/alice
# check the kernel message buffer
dmesg | grep -i -e lustre -e lnet
Source Code
Build the lustre from source…
- wiki.lustre.org …source code at whamcloud.com/public/lustre
- https://wiki.whamcloud.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel
Build a new Lustre client on CentOS:
# install build dependencies
sudo yum install "kernel-devel-uname-r == $(uname -r)"
sudo yum install -y \
\
asciidoc audit-libs-devel automake \
bc binutils-devel bison \
device-mapper-devel \
elfutils-devel elfutils-libelf-devel expect \
flex \
gcc gcc-c++ git glib2 glib2-devel \
hmaccalc \
kernel-devel keyutils-libs-devel krb5-devel \
ksh \
libattr-devel libblkid-devel libselinux-devel libtool libuuid-devel libyaml-devel lsscsi \
make \
ncurses-devel net-snmp-devel net-tools newt-devel numactl-devel \
openmpi-devel openssl-devel \
parted patchutils pciutils-devel perl-ExtUtils-Embed pesign python-devel \
redhat-rpm-config rpm-build \
systemd-devel \
tcl tcl-devel tk tk-devel \
wget \
xmlto \
yum-utils
zlib-devel# Download the Lustre source code
wget https://downloads.whamcloud.com/public/lustre/lustre-2.13.0/el7.7.1908/client/SRPMS/lustre-2.13.0-1.src.rpm
# build the client, and create RPM packages
rpmbuild --rebuild --without servers lustre-2.13.0-1.src.rpm
Compilation with Mellanox OFED distribution…
- …described in the Mellanox documentation in section Feature Overview and Configuration - Storage Protocols
- …for example in MLNX_OFED 5.4)
./configure --with-o2ib=/usr/src/ofa_kernel/default/
make rpms
Kernel Modules
Determine the version of the lustre-client
package installed on a node:
>>> dnf list installed | grep -e kernel-core -e lustre-client
kernel-core.x86_64 4.18.0-348.12.2.el8_5 @anaconda
kmod-lustre-client.x86_64 2.12.8_6_g5457c37-1.el8 @gsi-packages
lustre-client.x86_64 2.12.8_6_g5457c37-1.el8 @gsi-packages
# if multiple versions are installed
>>> dnf --showduplicates list kmod-lustre-client | tail -n+4 | sort -k2
The Lustre client kernel module package kmod-lustre-client
specifies the target Linux kernel in the packages description, for example:
# show package metadata
>>> dnf info kmod-lustre-client
...
Version : 2.12.8_6_g5457c37
...
Description : This package provides the lustre-client kernel modules built for
: the Linux kernel 4.18.0-348.2.1.el8_5.x86_64 for the x86_64
: family of processors.
In cases no matching kernel modules package is available…
lustre-client-dkms
package builds modules against kernel source package- Note that it is not recommended to use kernel modules build by DKMS.
# use Clustershell to identify lustre modules on nodes...
>>> date ; crush -b -- 'modinfo lustre | grep ^version'
...
lxbk[0264-0265,0267-0276,0279-0280].gsi.de (14)
---------------
version: 2.12.7
---------------
lxbk[0261-0262].gsi.de (2)
---------------
version: 2.12.8_6_g5457c37
Versionlock
Why version locking the Linux kernel and Lustre client?
- Kernels and Lustre kernel modules need to be upgraded together
- Typically there is a delay until a
lustre-client
package is available for a particular kernel… - …make sure the kernel is not upgraded until a Lustre client is available
Install the required DNF versionlock plugin…
dnf install -y python3-dnf-plugin-versionlock
The lock file would look similar to:
>>> cat >> /etc/dnf/plugins/versionlock.list <<EOF
kernel-0:4.18.0-513.*
kernel-core-0:4.18.0-513.*
kernel-modules-0:4.18.0-513.*
kernel-tools-0:4.18.0-513.*
kernel-tools-libs-0:4.18.0-513.*
kernel-headers-0:4.18.0-513.*
kernel-devel-0:4.18.0-513.*
lustre-client-0:2.15.4*
kmod-lustre-client-0:2.15.4*
EOF
Mount
I/O happens via a service called the Lustre client…
- …responsible for providing a POSIX
- …creates a coherent presentation of the metadata and object data
- …file system IO is transacted over a network protocol
Lustre networking configuration…
- …clients must have valid LNet configuration
- …low-level device layer called a Lustre Network Driver (LND)
- …abstraction between the upper level LNet protocol and the kernel device driver
- …
ko2iblnd.ko
module for RDMA networks ..uses OFED …referred to as theo2ib
LND - …continue to read about static LNet configuration
# ...example configuration for RDMA verbs
>>> cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0)"
Read the mount.lustre
manual page …use mount
to start Lustre client…
mount -t lustre [-o options] <mgsname>:/<fsname> <client_mountpoint>
<mgsname>:=<mgsnode>[:<mgsnode>]
- …colon-separated list of
mgsnode
- …names where the MGS service may run
- …colon-separated list of
<mgsnode>@<lnd_protocol><lnd#>
…LND protocol identifier and network number- …called an LNet Network Identifier (NID)
- …uniquely defines an interface for a host on an LNet communications fabric
fsname
…name of the file-system
# ...example mount...
mount -t lustre \
-o rw,nosuid,nodev,relatime,seclabel,flock,lazystatfs \
10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice /lustre/alice
Systemd Units
Systemd units to manage the Lustre mount point:
Unit | Description |
---|---|
lustre-*.mount |
Mounts a file-system to /lustre |
unload-lustre.service |
Forces unmount of Lustre and remove kernel modules when stopped |
lustre-params.service |
Uses lctl to configure Lustre client options |
lustre-jobstats.service |
Uses lctl to configure Slurm job statistics |
# list all units
>>> systemctl list-units *lustre*
UNIT LOAD ACTIVE SUB DESCRIPTION
lustre-alice.mount loaded active mounted Mount Lustre
lustre-jobstats.service loaded active exited Enable Lustre Jobstats for SLURM Compute Node
lustre-params.service loaded active exited Configure Lustre Parameters
unload-lustre.service loaded active exited Unload lustre modules on shutdown
Unmount Lustre storage and remove the kernel modules:
systemctl stop lustre-alice.mount unload-lustre.service
# following command should be zero if all modules have been removed...
lsmod | grep lustre | wc -l
# ...otherwise run...
lustre_rmmod
lustre-*.mount
Following a Systemd mount unit for a Lustre file-system…
>>> systemctl cat lustre-alice.mount
# /etc/systemd/system/lustre-alice.mount
[Unit]
Description=Mount Lustre
Requires=network-online.target
Wants=systemd-networkd-wait-online.service
After=network-online.target
[Install]
WantedBy=remote-fs.target
[Mount]
What=10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice
Where=/lustre/alice
Type=lustre
Options=rw,flock,relatime,_netdev,nodev,nosuid
LazyUnmount=true
ForceUnmount=true
…note that naming conventions for mount units are applied
Common Mount Options
Lustre mount options are described in man mount.lustre
…
flock
…coherent userspace file locking across multiple client nodes- …imposes communications overhead in order to maintain locking
- …defaults is
noflock
…applications getENOSYS
error
General mount options are described in man mount
…following may be relevant in context…
_netdev
…signal that file-system requires network accessrelatime
…cleaver update of access times …reduces RPC load on Lustrenodev
…ignore character or block special devicesnosuid
…ignore set-user-ID and set-group-ID
seclabel
Option
Enabled SELinux (including permissive
mode) may interfere with IO on Lustre…
# ...check for the seclabel mount option
findmnt /idril
TARGET SOURCE FSTYPE OPTIONS
/lustre/alice 10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice lustre rw,nosuid,nodev,relatime,seclabel,flock,lazystatfs
seclabel
is added by SElinux automatically …disable SELinux to prevent this
lustre-unload.service
Due to various reasons a clean unmount of a Lustre file-system may not work…
- …this could stop a node from properly rebooting (forcing a reset)
- Force
umount -f
to overcome this issue…- …
-a -t lustre
…applies to all Lustre file-systems - …
-l
(lazy) option ignores references to this filesystem (does not matter since we reboot anyway)
- …
[Unit]
Description=Unload lustre modules on shutdown
DefaultDependencies=no
Requires=remote-fs.target
Before=remote-fs.target shutdown.target
Conflicts=shutdown.target
[Install]
WantedBy=multi-user.target
[Service]
ExecStart=/bin/echo
RemainAfterExit=yes
ExecStop=/usr/bin/umount -f -l -a -t lustre
ExecStop=/usr/sbin/lustre_rmmod
Type=oneshot
lustre_rmmod
is the recommended method for unloading Lustre and LNet kernel module…
lustre-params.service
lctl
used to directly configure Lustre …after the file-system is mounted
Use a oneshot
Systemd service unit set Lustre configurations
[Install]
WantedBy=multi-user.target
[Unit]
Description=Configure Lustre Parameters
Documentation=man:lctl(8)
Requires=lustre.mount
After=lustre.mount
[Service]
ExecStart=/usr/sbin/lctl set_param osc.*.max_rpcs_in_flight=64
ExecStart=/usr/sbin/lctl set_param osc.*.max_dirty_mb=32
ExecStart=/usr/sbin/lctl set_param llite.*.statahead_max=128
ExecStart=/usr/sbin/lctl set_param llite.*.statahead_agl=1
ExecStart=/usr/sbin/lctl set_param llite.*.max_read_ahead_mb=128
ExecStart=/usr/sbin/lctl set_param llite.*.max_read_ahead_whole_mb=64
ExecStart=/usr/sbin/lctl set_param llite.*.max_read_ahead_per_file_mb=128
# ....
RemainAfterExit=yes
Type=oneshot
lustre-jobstats.service
Lustre can collect I/O statistics correlated to Slurm user…
- …creates overhead …use a dedicated service unit to enable/disable on demand
- Required parameters for
lctl
…jobid_var=
…name the environment variable set by the scheduler …typicallySLURM_JOB_ID
jobid_var=disable
…disable job stats
[Install]
WantedBy=multi-user.target
[Unit]
Description=Enable Lustre Jobstats for SLURM Compute Node
Documentation=man:lctl(8)
Requires=lustre.mount
After=lustre.mount
[Service]
ExecStart=/usr/sbin/lctl set_param jobid_var=SLURM_JOB_ID
ExecStop=/usr/sbin/lctl set_param jobid_var=disable
RemainAfterExit=yes
Type=oneshot
lustre-jobstats-proc.service
Track Slurm statistics per process name and user ID…
- …relevant to node where user work interactively (i.e. submit nodes)
- Required parameters for
lctl
…jobid_var=procname_uid
[Install]
WantedBy=multi-user.target
[Unit]
Description=Enable Lustre Jobstats from /proc
Documentation=man:lctl(8)
Requires=lustre.mount
After=lustre.mount
[Service]
ExecStart=/usr/sbin/lctl set_param jobid_var=procname_uid
ExecStop=/usr/sbin/lctl set_param jobid_var=disable
RemainAfterExit=yes
Type=oneshot
Configuration
lfs monitoring and configuration:
findmnt -t lustre --df # list Lustre file-systems with mount point
lfs help # list available options
lfs help <option> # show option specific information
lfs osts # list vailable OSTs
lfs osts | tail -n1 | cut -d: -f1 # number of OSTs
lfs df -h [<path>] # storage space per OST
lfs quota -h -u $USER [<path>] # storage quota for a user
lfs find -print -type f <path> # find files in a directory
Identify storage topology (cf. clush.md):
# get a list of all storage servers
>>> lctl get_param osc.*.ost_conn_uuid | ip2host | cut -d= -f2 | cut -d@ -f1 | cut -d. -f1 | sort | uniq | nodeset -f NS
lxfs[415-419]
# list OSTs per storage server
>>> nodeset-loop "echo -n '{} ' ; lctl get_param osc.*.ost_conn_uuid | ip2host | grep {} | cut -d'-' -f2 | tr '\n' ' '"
lxfs415 OST001c OST001d OST001e OST001f OST0020 OST0021 OST0022
lxfs416 OST0015 OST0016 OST0017 OST0018 OST0019 OST001a OST001b
lxfs417 OST000e OST000f OST0010 OST0011 OST0012 OST0013 OST0014
lxfs418 OST0007 OST0008 OST0009 OST000a OST000b OST000c OST000d
lxfs419 OST0000 OST0001 OST0002 OST0003 OST0004 OST0005 OST0006
Striping
Split a file into small sections (stripes) and distribute these for concurrent access to multiple OSTs.
- Advantages:
- The file size can be bigger then the storage capacity of a single OST.
- Enables to utilize the I/O bandwidth of multiple OSTs while accessing a single file.
- Disadvantages:
- Placing stripes of a file across multiple OSTs requires a management overhead. (Hence small files should not be striped)
- A higher number of OSTs holding stripes of a file increases the risk to losing access as soon as a single OST is unreachable.
lfs getstripe <file|dir> # show striping information
lfs setstripe -c <stripe_count> <file|dir> # configure the stripe count
lfs setstripe -i 0x<idx> <file|dir> # target a specific OST
- File inherit the striping configuration of their parent directory.
- Stipe Count (default 1)
- By default a single file is stored to a single OST.
- A count of
-1
stripes across all available OSTs (eventually used for very big files).
- Stripe Size (default 1MB)
- Maximum size of the individual stripes.
- Lustre sends data in 1MB chunks → stripe size are recommended to range between 1MB up to 4MB
Alignment
Application I/O performance is influenced by choosing the right file size and stripe count.
Correct I/O alignment mitigates the effects of:
- Resource contention on the OST block device.
- Request contention on the OSS hosting multiple OSTs.
General recommendations for stripe alignment:
- Minimize the number of OSTs a process/task must communicate with.
- Ensure that a process/task accesses a file at offsets corresponding to stripe boundaries.
Quotas
Lustre enforces quotas for Linux groups and users:
- Maximum consumable storage per group (
0k
equals unlimited) - Maximum number of files per user
Check the quota configuration using the lfs
command as root on a node with mounted Lustre:
lfs quota -q -h -g $group /lustre/alice
lfs quota -h -u $user /lustre/alice
I/O
Quantitative description of application IO from the perspective of the file-system:
- The size of data generated
- The number of files generated
- The distribution of file sizes
- The distributions of file IOs (requests sizes, frequency)
- The number of simulations IO accesses (level of concurrency)
IO requests/-sizes:
# enable (reset) client IO statistics
>>> lctl set_param llite.*.extents_stats=1
# ... execute application ...
>>> dd if=/dev/zero of=io1.sink count=1024 bs=1M
>>> dd if=/dev/zero of=io2.sink count=1024 bs=128k
>>> dd if=/dev/zero of=io3.sink count=1024 bs=32k
# read the stats for the client
>>> lctl get_param llite.*.extents_stats
read | write
extents calls % cum% | calls % cum%
32K - 64K : 0 0 0 | 1024 33 33
128K - 256K : 0 0 0 | 1024 33 66
1M - 2M : 0 0 0 | 1024 33 100
# read stats by process ID
>>> lctl get_param llite.*.extents_stats_per_process
read | write
extents calls % cum% | calls % cum%
PID: 27280
1M - 2M : 0 0 0 | 1024 100 100
PID: 27344
128K - 256K : 0 0 0 | 1024 100 100
PID: 27348
32K - 64K : 0 0 0 | 1024 100 100
RPC statistics:
>>> lctl set_param osc.*.rpc_stats=0 # reset the RPC counters
# monitor IO aggregation by Lustre
>>> lctl get_param osc.*.rpc_stats
read write
pages per rpc rpcs % cum % | rpcs % cum %
1024: 0 0 0 | 1276 99 100
Features
DNE (Distributed Namespace)
Distribute file/directorie metadata across multiple MDT…
- …circumvent bottleneck of a single MDT
- …scale metadata load across multiple MDT servers
- …load-balances file/directory metadata operations
- Benefits…
- …improves metadata performance
- …expands the maximum number of files per system
Creating directories to point to different DNE targets (Metadata Targets)…
# create a directory targeting MDT index 1
lfs mkdir -i 1 alice/
# similar for MDT index 2
lfs mkdir -i 2 bob/
…sub directories and files inherit the MDT target.
DOM (Data on MDT)
Store data of smaller files directly on and MDS…
- …improve small file performance
- …eliminate RPC overhead to OSTs
- …utilizes MDT high-IOPS storage optimized for small IO
- …used in conjunction with the Distributed Namespace (DNE)
- …improve efficiency without sacrificing horizontal scale
- References…
- Data on MDT Solution Architecture, Lustre Wiki
PCC (Persistent Client Cache)
PCC (Persistent Client Cache)
- …clients deliver additional performance…
- …using a local storage device (SSD/NVMe) as cache
- …reduce visible overhead for applications
- …for read and write intensive applications (node-local I/O patterns)
- …latencies and lock conflicts can be significantly reduced
- …I/O stack is much simpler (no interference I/Os from other clients)
- …caching reduces the pressure on (OSTs)…
- …small or random I/Os are regularized to big sequential I/Os directed to OSTs
- …temporary files do not need to be flushed to OSTs
- Mechanism based on..
- …combined HSM and layout lock mechanisms
- …single global namespace in two physical tiers…
- …migration of individual files between local and shared storage
- …local file system (such as ext4) used to manage the data on local caches
Synchronization between PCC and Lustre not tightly coupled…
- …PCC is not transparent to the user
- …mechanism of
lfs {attach,detach}
needs to be used properly - …
rm
command withoutlfs detach
loses data in PCC
- …mechanism of
- …disk space in PCC is independent of Lustre quotas
- …file size of PCC cached files is not visible on Lustre
Command line interface (lctl
admins, lfs
for users):
# ...add a PCC backend to the Lustre clien
lctl pcc add $mount_point $local_path_to_pcc [-p $params]
$mount_point
…specified Lustre file-system instance or Lustre mount point$local_path_to_pcc
…directory path on local file-system for PCC cache$params
…name-value pairs to configure the PCC back-end
# ...attach the given files onto PCC
lfs pcc attach -i $num $file ...
# ...detach the file from PCC permanently and remove the PCC copy after detach
lfs pcc detach $file
# ...keep the PCC copy in cache
lfs pcc detach -k $file
# ...display the PCC state for given files
lfs pcc state $file
Modes
Two modes…
PCC-RW read/write cache on local storage for a single client
- …uses HSM mechanism for data synchronization
- …cache entire files on their local files-systems
- …node is an HSM agent
- ….copy tool instance …with unique archive number
- …restore file from local cache on OSTs
- …triggered by another from another client
- …if PCC client goes offline …cached data becomes inaccessible (temporarily)
- Locks ensure that cache is consistent with the global file system state
- …includes a rule-based, configurable caching infrastructure
- …customizing I/O caching
- …provides performance isolation
- …QoS guarantees
PCC-RO read only cache on local storage of multiple clients
- …LDML lock to protect file data
- …grouplock prevents modification by any client
- …multiple replicates on different clients
- …data read from local cache
- …metadata read from MDT (with the exception of file size)
References
- Lustre Manual - Chapter 27. Persistent Client Cache (PCC)
- Lustre Persistent Client Cache, Whamcloud
- LU-10092, Whamcloud Jira
- A client side cache that speeds up applications with certain I/O patterns, Li Xi DDN Storage
- LUG 2018 Presentation, OpenSFS Administration Youtube
- Slurm burst buffer plugin with Lustre PCC(Persistent Client Cache)
WBC (Writeback Cache)
Client-side metadata writeback cache (instead of server-side)…
- …delayed & grouped metadata flush
- …instead of immediate RPC to MDS
- …no RPC round-trips for modifications of files/directories
- …cache in volatile memory (RAM) instead of persistent storage
- …use bulk RPC to flushes metadata of file in batch
- …flush limited to parts of a directory three modified
- …can be integrated with Persistent Client Cache (PCC)
Metadata flush happens…
- …when accessed from remote clients
- …to relieve memory pressure on local host
- …periodically to reduce risk of data loss
References…