Lustre HPC Storage System

Client Installation & Configuration

HPC
Storage
Published

July 10, 2015

Modified

January 29, 2024

Lean about Lustre from following resources…

Events in the Lustre community …LUG (Lustre User Group)

Packages

Available versions…

  • Long-term support (LTS) …stable release recommended for production environments
  • Feature release …ongoing development
  • Native Linux Kernel …support build into the Linux kernel (considered unstable)

Packages available…

Long-Term Support

Version Date Platform
2.15.0 2022/06 EL 8.5
2.15.1 2022/08 EL 8.6
2.15.2 2023/01 EL 8.7, EL 9.0
2.15.3 2023/06 EL 8.8, EL 9.2
2.15.4 2023/12 EL 8.9, EL 9.3

Previous LTS…

Date Version Platform
2020/06 2.12.5 EL 7.8, EL 8.2
2020/12 2.12.6 EL 7.9, EL 8.3
2021/07 2.12.7 EL 7.9, EL 8.4
2021/12 2.12.8 EL 7.9, EL 8.5
2022/06 2.12.9 EL 7.9, EL 8.6

Feature Releases

Version Date Platform
2.13 2019/12 EL 7.7, EL 8.0
2.14 2021/02 EL 8.3
2.16 2023/Q3 -
2.17 2024/Q3 -

Each version has another build designated with an -ib suffix which includes the lib{ib,rdma} libraries as well as the client kernel modules build against LTS version of {ref}mlnx_ofed. Lustre client packages used in production are added to the local repository into a sub-directory called lustre/.

Quick Start

Following exemplifies how install and configure the Lustre client manually:

# install a specific version of the client package
dnf install -y lustre-client

# LNet configuration
cat <<EOF > /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0)"
EOF

# load the kernel modules
modprobe lustre
modinfo lustre

# create the mount point
mkdir -p /lustre/alice

# mount the file-system
mount -t lustre 10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice /lustre/alice

# remove the mount
umount /lustre/alice

# check the kernel message buffer
dmesg | grep -i -e lustre -e lnet

Source Code

Build the lustre from source…

Build a new Lustre client on CentOS:

# install build dependencies
sudo yum install "kernel-devel-uname-r == $(uname -r)"
sudo yum install -y \
        asciidoc audit-libs-devel automake \
        bc binutils-devel bison \
        device-mapper-devel \
        elfutils-devel elfutils-libelf-devel expect \
        flex \
        gcc gcc-c++ git glib2 glib2-devel \
        hmaccalc \
        kernel-devel keyutils-libs-devel krb5-devel \
        ksh \
        libattr-devel libblkid-devel libselinux-devel libtool libuuid-devel libyaml-devel lsscsi \
        make \
        ncurses-devel net-snmp-devel net-tools newt-devel numactl-devel \
        openmpi-devel openssl-devel \
        parted patchutils pciutils-devel perl-ExtUtils-Embed pesign python-devel \
        redhat-rpm-config rpm-build \
        systemd-devel \
        tcl tcl-devel tk tk-devel \
        wget \
        xmlto \
        yum-utils \
        zlib-devel
# Download the Lustre source code
wget https://downloads.whamcloud.com/public/lustre/lustre-2.13.0/el7.7.1908/client/SRPMS/lustre-2.13.0-1.src.rpm 
# build the client, and create RPM packages
rpmbuild --rebuild --without servers lustre-2.13.0-1.src.rpm

Compilation with Mellanox OFED distribution…

  • …described in the Mellanox documentation in section Feature Overview and Configuration - Storage Protocols
  • …for example in MLNX_OFED 5.4)
./configure --with-o2ib=/usr/src/ofa_kernel/default/
 make rpms

Kernel Modules

Determine the version of the lustre-client package installed on a node:

>>> dnf list installed | grep -e kernel-core -e lustre-client
kernel-core.x86_64         4.18.0-348.12.2.el8_5   @anaconda           
kmod-lustre-client.x86_64  2.12.8_6_g5457c37-1.el8 @gsi-packages       
lustre-client.x86_64       2.12.8_6_g5457c37-1.el8 @gsi-packages    
# if multiple versions are installed
>>> dnf --showduplicates list kmod-lustre-client | tail -n+4 | sort -k2

The Lustre client kernel module package kmod-lustre-client specifies the target Linux kernel in the packages description, for example:

# show package metadata
>>> dnf info kmod-lustre-client
...
Version      : 2.12.8_6_g5457c37
...
Description  : This package provides the lustre-client kernel modules built for
             : the Linux kernel 4.18.0-348.2.1.el8_5.x86_64 for the x86_64
             : family of processors.

In cases no matching kernel modules package is available…

  • lustre-client-dkms package builds modules against kernel source package
  • Note that it is not recommended to use kernel modules build by DKMS.
# use Clustershell to identify lustre modules on nodes...
>>> date ; crush -b -- 'modinfo lustre | grep ^version'
...
lxbk[0264-0265,0267-0276,0279-0280].gsi.de (14)
---------------
version:        2.12.7
---------------
lxbk[0261-0262].gsi.de (2)
---------------
version:        2.12.8_6_g5457c37

Versionlock

Why version locking the Linux kernel and Lustre client?

  • Kernels and Lustre kernel modules need to be upgraded together
  • Typically there is a delay until a lustre-client package is available for a particular kernel…
  • …make sure the kernel is not upgraded until a Lustre client is available

Install the required DNF versionlock plugin…

dnf install -y python3-dnf-plugin-versionlock

The lock file would look similar to:

>>> cat >> /etc/dnf/plugins/versionlock.list <<EOF
kernel-0:4.18.0-513.*
kernel-core-0:4.18.0-513.*
kernel-modules-0:4.18.0-513.*
kernel-tools-0:4.18.0-513.*
kernel-tools-libs-0:4.18.0-513.*
kernel-headers-0:4.18.0-513.*
kernel-devel-0:4.18.0-513.*
lustre-client-0:2.15.4*
kmod-lustre-client-0:2.15.4*
EOF

Mount

I/O happens via a service called the Lustre client

  • …responsible for providing a POSIX
  • …creates a coherent presentation of the metadata and object data
  • …file system IO is transacted over a network protocol

Lustre networking configuration…

  • …clients must have valid LNet configuration
  • …low-level device layer called a Lustre Network Driver (LND)
  • …abstraction between the upper level LNet protocol and the kernel device driver
  • ko2iblnd.ko module for RDMA networks ..uses OFED …referred to as the o2ib LND
  • …continue to read about static LNet configuration
# ...example configuration for RDMA verbs
>>> cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0)"

Read the mount.lustre manual page …use mount to start Lustre client

mount -t lustre [-o options] <mgsname>:/<fsname> <client_mountpoint>
  • <mgsname>:=<mgsnode>[:<mgsnode>]
    • …colon-separated list of mgsnode
    • …names where the MGS service may run
  • <mgsnode>@<lnd_protocol><lnd#> …LND protocol identifier and network number
    • …called an LNet Network Identifier (NID)
    • …uniquely defines an interface for a host on an LNet communications fabric
  • fsname …name of the file-system
# ...example mount...
mount -t lustre \
      -o rw,nosuid,nodev,relatime,seclabel,flock,lazystatfs \
      10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice /lustre/alice

Systemd Units

Systemd units to manage the Lustre mount point:

Unit Description
lustre-*.mount Mounts a file-system to /lustre
unload-lustre.service Forces unmount of Lustre and remove kernel modules when stopped
lustre-params.service Uses lctl to configure Lustre client options
lustre-jobstats.service Uses lctl to configure Slurm job statistics
# list all units
>>> systemctl list-units *lustre*
UNIT                    LOAD   ACTIVE SUB     DESCRIPTION
lustre-alice.mount      loaded active mounted Mount Lustre
lustre-jobstats.service loaded active exited  Enable Lustre Jobstats for SLURM Compute Node
lustre-params.service   loaded active exited  Configure Lustre Parameters
unload-lustre.service   loaded active exited  Unload lustre modules on shutdown

Unmount Lustre storage and remove the kernel modules:

systemctl stop lustre-alice.mount unload-lustre.service

# following command should be zero if all modules have been removed...
lsmod | grep lustre | wc -l

# ...otherwise run...
lustre_rmmod

lustre-*.mount

Following a Systemd mount unit for a Lustre file-system…

>>> systemctl cat lustre-alice.mount
# /etc/systemd/system/lustre-alice.mount
[Unit]
Description=Mount Lustre
Requires=network-online.target
Wants=systemd-networkd-wait-online.service
After=network-online.target

[Install]
WantedBy=remote-fs.target

[Mount]
What=10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice
Where=/lustre/alice
Type=lustre
Options=rw,flock,relatime,_netdev,nodev,nosuid
LazyUnmount=true
ForceUnmount=true

…note that naming conventions for mount units are applied

Common Mount Options

Lustre mount options are described in man mount.lustre

  • flock …coherent userspace file locking across multiple client nodes
    • …imposes communications overhead in order to maintain locking
    • …defaults is noflock …applications get ENOSYS error

General mount options are described in man mount …following may be relevant in context…

  • _netdev …signal that file-system requires network access
  • relatime …cleaver update of access times …reduces RPC load on Lustre
  • nodev …ignore character or block special devices
  • nosuid …ignore set-user-ID and set-group-ID

seclabel Option

Enabled SELinux (including permissive mode) may interfere with IO on Lustre…

# ...check for the seclabel mount option
findmnt /idril
TARGET        SOURCE                                   FSTYPE OPTIONS
/lustre/alice 10.20.1.10@o2ib0:10.20.1.11@o2ib0:/alice lustre rw,nosuid,nodev,relatime,seclabel,flock,lazystatfs

seclabel is added by SElinux automatically …disable SELinux to prevent this

lustre-unload.service

Due to various reasons a clean unmount of a Lustre file-system may not work…

  • …this could stop a node from properly rebooting (forcing a reset)
  • Force umount -f to overcome this issue…
    • -a -t lustre …applies to all Lustre file-systems
    • -l (lazy) option ignores references to this filesystem (does not matter since we reboot anyway)
[Unit]
Description=Unload lustre modules on shutdown
DefaultDependencies=no
Requires=remote-fs.target
Before=remote-fs.target shutdown.target
Conflicts=shutdown.target

[Install]
WantedBy=multi-user.target

[Service]
ExecStart=/bin/echo
RemainAfterExit=yes
ExecStop=/usr/bin/umount -f -l -a -t lustre
ExecStop=/usr/sbin/lustre_rmmod
Type=oneshot

lustre_rmmod is the recommended method for unloading Lustre and LNet kernel module…

lustre-params.service

lctl used to directly configure Lustre …after the file-system is mounted

Use a oneshot Systemd service unit set Lustre configurations

[Install]
WantedBy=multi-user.target

[Unit]
Description=Configure Lustre Parameters
Documentation=man:lctl(8)
Requires=lustre.mount
After=lustre.mount

[Service]
ExecStart=/usr/sbin/lctl set_param osc.*.max_rpcs_in_flight=64
ExecStart=/usr/sbin/lctl set_param osc.*.max_dirty_mb=32
ExecStart=/usr/sbin/lctl set_param llite.*.statahead_max=128
ExecStart=/usr/sbin/lctl set_param llite.*.statahead_agl=1
ExecStart=/usr/sbin/lctl set_param llite.*.max_read_ahead_mb=128
ExecStart=/usr/sbin/lctl set_param llite.*.max_read_ahead_whole_mb=64
ExecStart=/usr/sbin/lctl set_param llite.*.max_read_ahead_per_file_mb=128
# ....
RemainAfterExit=yes
Type=oneshot

lustre-jobstats.service

Lustre can collect I/O statistics correlated to Slurm user…

  • …creates overhead …use a dedicated service unit to enable/disable on demand
  • Required parameters for lctl
    • jobid_var= …name the environment variable set by the scheduler …typically SLURM_JOB_ID
    • jobid_var=disable …disable job stats
[Install]
WantedBy=multi-user.target

[Unit]
Description=Enable Lustre Jobstats for SLURM Compute Node
Documentation=man:lctl(8)
Requires=lustre.mount
After=lustre.mount

[Service]
ExecStart=/usr/sbin/lctl set_param jobid_var=SLURM_JOB_ID
ExecStop=/usr/sbin/lctl set_param jobid_var=disable
RemainAfterExit=yes
Type=oneshot

lustre-jobstats-proc.service

Track Slurm statistics per process name and user ID…

  • …relevant to node where user work interactively (i.e. submit nodes)
  • Required parameters for lctljobid_var=procname_uid
[Install]
WantedBy=multi-user.target

[Unit]
Description=Enable Lustre Jobstats from /proc
Documentation=man:lctl(8)
Requires=lustre.mount
After=lustre.mount

[Service]
ExecStart=/usr/sbin/lctl set_param jobid_var=procname_uid
ExecStop=/usr/sbin/lctl set_param jobid_var=disable
RemainAfterExit=yes
Type=oneshot

Configuration

lfs monitoring and configuration:

findmnt -t lustre --df                 # list Lustre file-systems with mount point
lfs help                               # list available options
lfs help <option>                      # show option specific information
lfs osts                               # list vailable OSTs
lfs osts | tail -n1 | cut -d: -f1      # number of OSTs
lfs df -h [<path>]                     # storage space per OST
lfs quota -h -u $USER [<path>]         # storage quota for a user
lfs find -print -type f <path>         # find files in a directory

Identify storage topology (cf. clush.md):

# get a list of all storage servers
>>> lctl get_param osc.*.ost_conn_uuid | ip2host | cut -d= -f2 | cut -d@ -f1 | cut -d. -f1 | sort | uniq | nodeset -f NS
lxfs[415-419]
# list OSTs per storage server
>>> nodeset-loop "echo -n '{} ' ; lctl get_param osc.*.ost_conn_uuid | ip2host | grep {} | cut -d'-' -f2 | tr '\n' ' '"
lxfs415 OST001c OST001d OST001e OST001f OST0020 OST0021 OST0022
lxfs416 OST0015 OST0016 OST0017 OST0018 OST0019 OST001a OST001b
lxfs417 OST000e OST000f OST0010 OST0011 OST0012 OST0013 OST0014
lxfs418 OST0007 OST0008 OST0009 OST000a OST000b OST000c OST000d
lxfs419 OST0000 OST0001 OST0002 OST0003 OST0004 OST0005 OST0006

Striping

Split a file into small sections (stripes) and distribute these for concurrent access to multiple OSTs.

  • Advantages:
    • The file size can be bigger then the storage capacity of a single OST.
    • Enables to utilize the I/O bandwidth of multiple OSTs while accessing a single file.
  • Disadvantages:
    • Placing stripes of a file across multiple OSTs requires a management overhead. (Hence small files should not be striped)
    • A higher number of OSTs holding stripes of a file increases the risk to losing access as soon as a single OST is unreachable.
lfs getstripe <file|dir>                    # show striping information
lfs setstripe -c <stripe_count> <file|dir>  # configure the stripe count  
lfs setstripe -i 0x<idx> <file|dir>         # target a specific OST
  • File inherit the striping configuration of their parent directory.
  • Stipe Count (default 1)
    • By default a single file is stored to a single OST.
    • A count of -1 stripes across all available OSTs (eventually used for very big files).
  • Stripe Size (default 1MB)
    • Maximum size of the individual stripes.
    • Lustre sends data in 1MB chunks → stripe size are recommended to range between 1MB up to 4MB

Alignment

Application I/O performance is influenced by choosing the right file size and stripe count.

Correct I/O alignment mitigates the effects of:

  • Resource contention on the OST block device.
  • Request contention on the OSS hosting multiple OSTs.

General recommendations for stripe alignment:

  • Minimize the number of OSTs a process/task must communicate with.
  • Ensure that a process/task accesses a file at offsets corresponding to stripe boundaries.

Quotas

Lustre enforces quotas for Linux groups and users:

  • Maximum consumable storage per group (0k equals unlimited)
  • Maximum number of files per user

Check the quota configuration using the lfs command as root on a node with mounted Lustre:

lfs quota -q -h -g $group /lustre/alice
lfs quota -h -u $user /lustre/alice

I/O

Sybsystem Map

Quantitative description of application IO from the perspective of the file-system:

  1. The size of data generated
  2. The number of files generated
  3. The distribution of file sizes
  4. The distributions of file IOs (requests sizes, frequency)
  5. The number of simulations IO accesses (level of concurrency)

IO requests/-sizes:

# enable (reset) client IO statistics
>>> lctl set_param llite.*.extents_stats=1
# ... execute application ...
>>> dd if=/dev/zero of=io1.sink count=1024 bs=1M
>>> dd if=/dev/zero of=io2.sink count=1024 bs=128k
>>> dd if=/dev/zero of=io3.sink count=1024 bs=32k
# read the stats for the client
>>> lctl get_param llite.*.extents_stats
                               read       |                write
      extents            calls    % cum%  |          calls    % cum%
  32K -   64K :              0    0    0  |           1024   33   33
 128K -  256K :              0    0    0  |           1024   33   66
   1M -    2M :              0    0    0  |           1024   33  100
# read stats by process ID
>>> lctl get_param llite.*.extents_stats_per_process
                               read       |                write
      extents            calls    % cum%  |          calls    % cum%
PID: 27280
   1M -    2M :              0    0    0  |           1024  100  100
PID: 27344
 128K -  256K :              0    0    0  |           1024  100  100
PID: 27348
  32K -   64K :              0    0    0  |           1024  100  100

RPC statistics:

>>> lctl set_param osc.*.rpc_stats=0 # reset the RPC counters
# monitor IO aggregation by Lustre
>>> lctl get_param osc.*.rpc_stats
                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1024:                    0   0   0   |       1276  99 100

Features

DNE (Distributed Namespace)

Distribute file/directorie metadata across multiple MDT…

  • …circumvent bottleneck of a single MDT
  • …scale metadata load across multiple MDT servers
  • …load-balances file/directory metadata operations
  • Benefits…
    • …improves metadata performance
    • …expands the maximum number of files per system

Creating directories to point to different DNE targets (Metadata Targets)…

# create a directory targeting MDT index 1
lfs mkdir -i 1 alice/
# similar for MDT index 2
lfs mkdir -i 2 bob/

…sub directories and files inherit the MDT target.

DOM (Data on MDT)

Store data of smaller files directly on and MDS…

  • …improve small file performance
  • …eliminate RPC overhead to OSTs
  • …utilizes MDT high-IOPS storage optimized for small IO
  • …used in conjunction with the Distributed Namespace (DNE)
  • …improve efficiency without sacrificing horizontal scale
  • References…

PCC (Persistent Client Cache)

PCC (Persistent Client Cache)

  • …clients deliver additional performance…
    • …using a local storage device (SSD/NVMe) as cache
    • …reduce visible overhead for applications
      • …for read and write intensive applications (node-local I/O patterns)
      • …latencies and lock conflicts can be significantly reduced
    • …I/O stack is much simpler (no interference I/Os from other clients)
  • …caching reduces the pressure on (OSTs)…
    • …small or random I/Os are regularized to big sequential I/Os directed to OSTs
    • …temporary files do not need to be flushed to OSTs
  • Mechanism based on..
    • combined HSM and layout lock mechanisms
    • …single global namespace in two physical tiers…
    • …migration of individual files between local and shared storage
    • …local file system (such as ext4) used to manage the data on local caches

Synchronization between PCC and Lustre not tightly coupled…

  • …PCC is not transparent to the user
    • …mechanism of lfs {attach,detach} needs to be used properly
    • rm command without lfs detach loses data in PCC
  • …disk space in PCC is independent of Lustre quotas
  • …file size of PCC cached files is not visible on Lustre

Command line interface (lctl admins, lfs for users):

# ...add a PCC backend to the Lustre clien
lctl pcc add $mount_point $local_path_to_pcc [-p $params]
  • $mount_point …specified Lustre file-system instance or Lustre mount point
  • $local_path_to_pcc …directory path on local file-system for PCC cache
  • $params …name-value pairs to configure the PCC back-end
# ...attach the given files onto PCC
lfs pcc attach -i $num $file ...

# ...detach the file from PCC permanently and remove the PCC copy after detach
lfs pcc detach $file
# ...keep the PCC copy in cache
lfs pcc detach -k $file

# ...display the PCC state for given files
lfs pcc state $file

Modes

Two modes…

PCC-RW read/write cache on local storage for a single client

  • …uses HSM mechanism for data synchronization
  • …cache entire files on their local files-systems
  • …node is an HSM agent
    • ….copy tool instance …with unique archive number
    • …restore file from local cache on OSTs
    • …triggered by another from another client
  • …if PCC client goes offline …cached data becomes inaccessible (temporarily)
  • Locks ensure that cache is consistent with the global file system state
  • …includes a rule-based, configurable caching infrastructure
    • …customizing I/O caching
    • …provides performance isolation
    • …QoS guarantees

PCC-RO read only cache on local storage of multiple clients

  • …LDML lock to protect file data
  • …grouplock prevents modification by any client
  • …multiple replicates on different clients
  • …data read from local cache
  • …metadata read from MDT (with the exception of file size)

References

WBC (Writeback Cache)

Client-side metadata writeback cache (instead of server-side)…

  • …delayed & grouped metadata flush
    • …instead of immediate RPC to MDS
    • …no RPC round-trips for modifications of files/directories
  • …cache in volatile memory (RAM) instead of persistent storage
    • …use bulk RPC to flushes metadata of file in batch
    • …flush limited to parts of a directory three modified
  • …can be integrated with Persistent Client Cache (PCC)

Metadata flush happens…

  • …when accessed from remote clients
  • …to relieve memory pressure on local host
  • …periodically to reduce risk of data loss

References…