Ceph — Distributed Fault-tolerant Storage

Storage

Published

July 10, 2015

Modified

May 2, 2025

Overview

Ceph (pronounced /ˈsɛf/) …birds view…

…open source LGPL …web-site ceph.io
…provides object, block, and file storage (POSIX)
…completely distributed operation
- …no single point of failure (highly available)
- …replicates data with fault tolerance
- …data durability …replication, erasure coding, snapshots and clones
- …self-healing and self-managing (minimal administration time/cost)
…stable, named releases every 12 month…
- …backports for 2 releases …upgrade up to 2 releases at a time
- …complete release history available on Wikipedia
- …18 major releases …since 2012/06

Governance

Governance (state 2023)…

…public telemetry data …2.5k clusters …>1EB storage
…Ceph governance model…
- …executive council …elected by steering committee
- …steering committee …20+ members (predominantly Red Hat, SUSE, Intel)
- …leadership team …group effort, shared leadership, open meetings
…non-profit Ceph Foundation …30+ organisations
- …neutral home for project stakeholders
- …organized as a directed fund under the Linux Foundation

Use-Cases

Storage strategy…

…store data that serves a particular use case, for example
- …store block-devices for OpenStack
- …object data for an RESTful S3-compliant or Swift-compliant gateway
- …container-based storage for Kubernetes
- …providing a POSIX file-system
…influences storage media selection…
- …requires trade-off between performance vs. costs
- …faster vs. bigger vs. higher durability
- …for example…
  - …SSD-backed OSDs for a high performance pool
  - …SAS drive/SSD journal-backed OSDs …high-performance block device volumes and images
  - …SATA-backed OSDs for low cost storage

Note: Ceph’s object copies or coding chunks handle durability …make RAID obsolete

Storage strategy requires to assign Ceph OSDs to a CRUSH hierarchy…

…select appropriate storage media for a use-case …identify appropriate OSDs
…define CRUSH hierarchy…select storing placement groups
…define CRUSH rule setting …calculate placement groups (shards)
…create pools …select replicated or erasure-coded storage …durability

Hardware

Identifying a price-to-performance profile suitable for targeted workload…

IOPS optimized…
- …block storage …for example database instances as virtual machines on OpenStack
- …lowest cost per IOPS …highest IOPS per GB …99th percentile latency consistency
- …SSD storage (more expensive) …improves IOPS & total throughput
- …15k RPM SAS drives …separate SSD journals (handle frequent writes)
- …typically 3x replication for HDD …2x replication for SSD
Throughput optimized…
- …fast data storage …block or object storage
- …HDD with acceptable total throughput characteristics
- …lowest cost per MB …highest MB/s per TB/BTU/Watt …97th percentile latency consistency
- …SSD journals (more expensive) …improve write performance
- …typically 3x replication
Capacity optimized…
- …(inexpensive) big data storage …slower and less expensive SATA drives
- …lowest cost/BTU/Watt per TB
- …erasure coding common for maximizing usable capacity
- …co-locate journals …cheaper then SSD journals

Multiple performance domains can coexist in a Ceph storage cluster

OSD hardware within a pool requires to be identical

…avoid dissimilar hardware in the same pool!
Same…
- …controller, drive size, RPMs, seek times, I/O bandwidth
- …network throughput, journal configuration
…simplifies provisioning …streamlines troubleshooting

Network requires sufficient throughput during backfill and recovery operations…

…consider storage density when planing for rebalancing after hardware faults
…prefer that a storage cluster recovers as quickly as possible
…avoid IOPS starvation due to network latency …be mindful of network link over-subscription
…segregate intra-cluster traffic from client-to-cluster traffic
- …public network …client traffic …communication to Ceph Monitors
- …storage network …Ceph OSD …heartbeats, replication, backfilling, and recovery traffic
- …(public and storage cluster networks on separate network cards)
…most performance-related problems in Ceph usually begin with a networking issue
- …jumbo frames for a better CPU per bandwidth ratio
- …non-blocking network switch back-plane
- …same MTU throughout all networking devices (both public and storage networks)

Date Durability

Ceph is fully consistent…

…data shared across availability zones (AZs), racks, nodes, disks
…replication configurable to failure domains
…even in extreme disasters, data can be recovered manually
Erasure coding support…
- …commonly used for object storage
- …works on block storage & shared file-systems

Architecture

Ceph deployment consists of three types of daemons…

Ceph OSD Daemon (OSD)
- …stores data …perform data replication, erasure coding
- …rebalancing, recovery, monitoring and reporting functions
- …store all data as objects in a flat namespace (no directory hierarchies)
- …object have cluster-wide unique identifier, binary data, and metadata
Ceph Monitor (MON) — Master copy of cluster map
Ceph Manager
- …information about placement groups
- …process metadata and host metadata
- …RESTful monitoring APIs
Ceph Client — Read/write data from the OSDs
- …retrieve the latest cluster map from MONs
- …computes object placement group and the primary OSD for data access
- …no intermediary server, broker or bus between the client and the OSD
- …clients define the semantics for the data format (i.e. block device)

CRUSH Algorithm

CRUSH (Controlled Replication Under Scalable Hashing)…

…track the location of storage objects
…no central authority/lookup table (distributed hashing)
…efficiently compute information about object location
…enables massive scale …data location algorithmically computed on clients

CRUSH maps…

…describes a topography of cluster resources
…implemented as a simple hierarchy (acyclic graph) and a ruleset
…the number of placement groups, and the pool interface
…map exists both on client nodes as well as Ceph Monitor (MON)
- …MONs host master copy of the CRUSH maps
- …MONs not involved in data I/O path

CRUSH Ruleset…

..Ceph uses the CRUSH map to implement failure domains…
- …and performance domains for the storage media
- …supports multiple hierarchies to separate one hardware performance profile from another

Pools

Pools¹ — Logical partitions …store RADOS objects

…administrators can create pools for particular types of data
Pool type
- …erasure coding …data durability method is pool-wide
- …completely transparent to the client
…contain a defined number of placement groups (PGs)
- …sharding a pool into placement groups
- …with a configured replication level
Attributes for ownership/access
Balance number of PGs per pool with the number of PGs per OSD
- 50-200 PGs per OSD to balance out memory and CPU requirements and per-OSD load
- Total Placement Groups = (OSDs*(50-200))/Number of replica

ceph df                                              # show usage
ceph osd lspools                                     # list pool numbers
ceph osd pool ls detail                              # list pools with additional information
ceph osd dump | grep ^pool                           # list replication size
cpeh osd pool create <name> <pgs> <pgs>              # create a pool 
ceph osd pool get <pool> <key>                       # read configuration attribute
ceph osd pool set <pool> <key> <value>               # set configuration attribute

Keys:

size number of replica objects
min_size minimum number of replica available for IO
pg_num,pgp_num (effective) number of PGs to use when calculating data placement
crush_ruleset to use for mapping object placement in the cluster (C

States:

active+clean – optimum state
degraded – not enough replicas to meet requirement
down – no OSD available storing PG
inconsistent – PG nor consistent across different OSDs
repair – correcting inconsistent PGs to meet replication requirements

Delete a pool

# pool name twice
ceph osd pool delete $pool $pool --yes-i-really-really-mean-it

Monitor Servers (MONs)

Datastore for the health of the entire cluster…

…hosts master copy of the CRUSH, OSD & MON maps
…minimum of three MONs for a cluster quorum in production
- …there must be on odd number >=3 to provide consensus…
- …for the use of distributed decision-making with the Paxos algorithm
- …non-quorate pools become unavailable
OSD map reflect the OSD daemons operating in the cluster

systemctl status ceph-mon@$(hostname)                 # daemon state       
ceph-mon -d -i $(hostname)                            # run daemon in foreground
/var/log/ceph/ceph-mon.$(hostname).log                # default log location
ceph mon stat                                         # state of all monitors
ceph -f json-pretty quorum_status                     # quorum information
ceph-conf --name mon.$(hostname) --show-config-value log_file
                                                      # get location of the log file
ceph --admin-daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status
                                                      # access a monitors admin socket

Manipulate the MON map:

monmap=/tmp/monmap.bin
ceph mon getmap -o $monmap                            # print the MON map if a quorum exists
ceph-mon --extract-monmap $monmap -i $(hostname)      # extract the MON map of ceph-mon stopped
monmaptool $monmap --print                            # print MON map from file
monmaptool $monmap --clobber --rm lxmon03             # remove a monitor
ceph-mon --inject-monmap $monmap -i $(hostname)       # store new map

Object Storage Devices (OSDs)

Node hosting physical storage:

ceph-osd is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network
Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
OSD daemons are numerically identified: osd.<id>, and have following states in/out of the cluster, and up/down
Primary OSDs are responsible for replication, coherency, re-balancing and recovery
Secondery OSDs are under control of a primary OSD, and can become primary
Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
- Enables atomic updates to objects, and replay of the journal on restart
- Write ACK when minimum replica journals written
- Journal sync to file-system periodically to reclaim space

/var/log/ceph/*.log                                  # logfiles on storage servers    
ceph osd stat                                        # show state 
cpeh osd df                                          # storage utilization by OSD
ceph node ls osd                                     # list OSDs
ceph osd dump                                        # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree                                        # OSD/CRUSH map as tree
ceph osd getmaxosd                                   # show number of available OSDs
ceph osd map <pool> <objname>                        # identify object location
/var/run/ceph/*                                      # admin socket
ceph daemon <socket> <command>                       # use the admin socket
ceph daemon <socket> status                          # show identity

Manipulate the CRUSH map:

ceph osd getcrushmap -o /tmp/crushmap.bin                    # extract teh CRUSH map 
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt          # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin          # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin                    # inject CRUSH map

Remove an OSD from production:

ceph osd out <num>                                   # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num>                        # stop the OSD daemon on the server 
ceph osd crush remove osd.<num>                      # remove the OSD from the CRUSH map
ceph auth del osd.<num>                              # remove authorization credential
ceph osd rm osd.<num>                                # remove OSD from configuration
# remove from clean ceph.conf if required

Placement Groups (PGs)

Logical collection of objects:

CRUSH assigns objects to placement groups
CRUSH assigns a placement group to a primary OSD
The Primary OSD use CRUSH to replicate the PGs to the secondary OSDs
Re-balance/recover works on all objects in a placement group
Placement Group IDs (pgid) <pool_num>.<pg_id> (hexadecimal)

ceph pg stat                                         # show state
ceph pg ls-by-osd $num                               # lost PGs store on OSD
ceph pg map $pgid                                    # map a placement group
ceph pg dump_stuck inactive|unclean|stale|undersized|degraded
                                                     # statisitcs for stuck PGs
ceph pg $pgid query                                  # statistics for a particular placement group
ceph pg scrub $pgid                                  # check primary and replicas

osdmap e20 pg 2.f22 (2.22) -> up [0,1,2] acting [0,1,2]
       │   │                  │          └── acting set of OSDs responsible for a particular PG
       │   │                  └───────────── matches "acting set" unless rebalancing in progress 
       │   └──────────────────────────────── placement group number                   
       └──────────────────────────────────── monotonically increasing OSD map version number

Recommended Number of PGs pg_num in relation to the number of OSDs

OSDs       PGs
<5         128
5~10       512
10~50      4096
>50        50-100 per OSD

Object Storage Devices (OSDs)

Node hosting physical storage:

ceph-osd is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network
Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
OSD daemons are numerically identified: osd.<id>, and have following states in/out of the cluster, and up/down
Primary OSDs are responsible for replication, coherency, re-balancing and recovery
Secondery OSDs are under control of a primary OSD, and can become primary
Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
- Enables atomic updates to objects, and replay of the journal on restart
- Write ACK when minimum replica journals written
- Journal sync to file-system periodically to reclaim space

/var/log/ceph/*.log                                  # logfiles on storage servers    
ceph osd stat                                        # show state 
cpeh osd df                                          # storage utilization by OSD
ceph node ls osd                                     # list OSDs
ceph osd dump                                        # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree                                        # OSD/CRUSH map as tree
ceph osd getmaxosd                                   # show number of available OSDs
ceph osd map <pool> <objname>                        # identify object location
/var/run/ceph/*                                      # admin socket
ceph daemon <socket> <command>                       # use the admin socket
ceph daemon <socket> status                          # show identity

Manipulate the CRUSH map:

ceph osd getcrushmap -o /tmp/crushmap.bin                    # extract teh CRUSH map 
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt          # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin          # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin                    # inject CRUSH map

Remove an OSD from production:

ceph osd out <num>                                   # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num>                        # stop the OSD daemon on the server 
ceph osd crush remove osd.<num>                      # remove the OSD from the CRUSH map
ceph auth del osd.<num>                              # remove authorization credential
ceph osd rm osd.<num>                                # remove OSD from configuration
# remove from clean ceph.conf if required

Ceph File-System

CephFS² — POSIX file-system on top of Ceph RADOS

Metadata is stored in a RADOS pool separate from file data…
…served via a resizable cluster of Metadata Servers (MDS)

Volumes

Abstraction for CephFS file systems³…

Volumes
- …represent the entire CephFS file system
- …foundational layer for file-based storage
Subvolumes
- …data organization & quota management
- …per application (tenant) directory trees
Subvolume Groups
- …policies across groups of subvolumes
- …enforce administrative rules

Volume management is crucial for optimizing CephFS performance and security

Why use sub volumes?
- …used to support Container Storage Interface (CSI) volumes
- …abstraction for independent CephFS directory trees
…recommended to consolidate file system workloads onto single CephFS

# list volumes
ceph fs volume ls

# create a CephFS volume on the monitor node
ceph fs volume create $vol_name

# information of a CephFS volume
ceph fs volume info $vol_name --human_readable

# delete a volume
ceph fs volume rm $vol_name --yes-i-really-mean-it

CephFS subvolume groups — Abstraction at a directory level

ceph fs subvolumegroup ls $vol_name

# absolute path of the CephFS subvolume group
ceph fs subvolumegroup getpath $vol_name $group_name

ceph fs subvolumegroup snapshot ls $vol_name $group_name

Deployment

Recommend deployment method is cephadm…

…preferred bare-metal installation and management method
cephadm-ansible …simplify workflows that are not covered by cephadm
Alternatives …manual installation
- …ceph-ansible still maintained …not integrated with the orchestrator APIs
- …Rock for deployment in Kubernetes
Deploying a new Ceph cluster…
- …requirements …Python 3, Systemd, Podman (or Docker), NTP, LVM2
- …list supported version at latest releases

Packages

RPM packages…

…download.ceph.com host RPM package for Enterprise Linux
…Fedora cephadm package

# ...package provided by the Fedora community
dnf -y install cephadm

# ...packages provided by the Ceph community
curl --silent --remote-name --location https://download.ceph.com/rpm-18.2.0/el9/noarch/cephadm
chmod +x cephadm ; ./cephadm add-repo --release reef ; ./cephadm install

…alternatively add the package repository manually…

sudo dnf install -y epel-release
# ...for Enterprise Linux 9
sudo rpm --import 'https://download.ceph.com/keys/release.asc'
sudo dnf install -y https://download.ceph.com/rpm-18.2.0/el9/noarch/ceph-release-1-1.el9.noarch.rpm
dnf makecache ; sudo dnf install -y cephadm

`cephadm`

Utility that is used to manage a Ceph cluster

…manages the full lifecycle of a Ceph cluster
…does not rely on external configuration tools (like Ansible)
…builds on the orchestrator API
…uses SSH connections to interact with hosts
…deploys services using standard container images
…sequence is started from the command line on the first host of the cluster

Operations

ceph status                                          # summery of state
ceph -s
ceph health detail
ceph -w                                              # watch mode
ceph-deploy --overwrite-conf config push <node>      # update the configuration after changes

Authentication

cephx authentication protocol (…similar to Kerberos)

…MONs can authenticate users and distribute keys
- …no single point of failure or bottleneck
- …return a sessions key (tickets) to obtain Ceph services
…Ceph monitors and OSDs share a secret

Administrator must set up users…

# lists authentication state
ceph auth list

# create a new user 
ceph auth get-or-create client.<name> <caps>

CephFS

CephFS — POSIX-compliant distributed file-system…

…with Linux kernel client & FUSE client
…the RADOS store hosts metadata & data pools
MDSs (Metadata Server Daemons) — Coordinate clients data access

Operate on the CephFS file-systems⁴ in your Ceph cluster:

# list all file-system names
ceph fs ls

# destroy a file-system
ceph fs rm $name --yes-i-really-mean-it

Clients

Client are distributed in the ceph-common package

/etc/ceph/ceph.conf                                  # default location for configuration
/etc/ceph/ceph.client.admin.keyring                  # default location for admin key

RADOS

Objects store data and have: a name, the payload (data), and attributes
Object namespace is flat

rados lspools                                        # list pools
rados df                                             # show pool statistics
rados mkpool <pool>                                  # create a new pool     
rados rmpool <pool>                                  # delete a pool
rados ls -p <pool>                                   # list contents of pool
rados -p <pool> put <objname> <file>                 # store object
rados -p <pool> rm <objname>                         # delete object

RBD

Block interface on top of RADOS:

Images (single block devices) striped over multiple RADOS objects/OSDs
The default pool is called rbd

mod(probe|info) rbd                                  # kernel module
/sys/bus/rbd/                                        # live module data
rbd ls [-l] [<pool>]                                 # list block devices
rbd info [<pool>/]<name>                             # introspec image
rbd du [<name>]                                      # list size of images
rbd create --size <MB> [<pool>/]<name>               # create image
rbd map <name>                                       # map image to device
rbd showmapped                                       # list device mappring
rbd unmap <name>                                     # 
rbd rm [<pool>/]<name>                               # remove image

Snapshots

Read-only copy of the state of an image at a particular point in time
Layering allows for cloning snapshots
Stop I/O before taking a snapshot (internal file-systems must be in a consistent state)
Rolling back overwrites the current version of the image
It is faster to clone from a snapshot than to rollback to return to an pre-existing state

rbd snap create <pool>/<image>@<name>                # snapshot image
rbd snap ls <pool>/<image>                           # list snapshots of image
rbd snap rollback <pool>/<image>@<name>              # rollback to snapshot
rbd snap rm <pool>/<image>@<name>                    # delete snapshot
rbd snap purge <pool>/<image>                        # delete all snapshots
rbd snap protect <pool>/<image>@<name>               # protect snapshot before cloneing
rbd snap unprotect <pool>/<image>@<name>             # revert protection
rbd clone <pool>/<image>@<name> <pool>/<image>       # clone snapshot to new image
rbd children <pool>/<image>@<name>                   # list snapshot decendents
rbd flatten <pool>/<image>                           # decouple from parent snapshot

Libvirt

Configure a pool…

ceph osd pool create libvirt 128 128                 # create a pool for libvirt
ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt'
                                                     # create a user for libvirt
virsh secret-define --file secret.xml                # define the secret
uuid=$(virsh secret-list | grep client.libvirt | cut -d' ' -f 2)
                                                     # uuid of the secret
key=$(ceph auth get-key client.libvirt)              # ceph lcient key
virsh secret-set-value --secret "$uuid" --base64 "$key" 
                                                     # set the UUID of the secret

Libvirt secret.xml associated with a Ceph RBD

<secret ephemeral='no' private='no'>
  <usage type='ceph'>
    <name>client.libvirt secret</name>
  </usage>
</secret>

Usage

qemu-img create -f raw rbd:<pool>/<image> <size>     # create a new image block device
qemu-img info rbd:<pool>/<image>                     # block device states
qemu-img resize rbd:<pool>/<image> <size>            # adjust image size
# create new RBD image from source qcow2 image, and vice versa
qemu-img convert -f qcow2 <path> -O rbd rbd:<pool>/<image>
qemu-img convert -f rbd rbd:<pool>/<image> -O qcow2 <path>

RBD image as <disk> entry for a virtual machine; replace:

source/name= with <pool>/<image>
host/name= with the IP address of a monitor server (multiples supported)
secret/uuid= with the corresponding libvirt secret UUID

<disk type="network" device="disk">
  <source protocol='rbd' name='libvirt/lxdev01.devops.test'>
    <host name='10.1.1.22' port='6789'/>
  </source>
  <auth username='libvirt'>
    <secret type='ceph' uuid='ec8a08ff-59b2-49fd-8162-532c9b0a2ed6'/>
  </auth>
  <target dev="vda" bus="virtio"/>
</disk>

ceph osd map <pool> <image>                                # map image to PGs
virsh qemu-monitor-command --hmp <instance> 'info block'   # configuration of the VM instance 
virsh dumpxml <instance> | xmlstarlet el -v | grep rbd     # get attached RBD image

Test

# work in a disposable path
pushd $(mktemp -d /tmp/$USER-vagrant-XXXXXX)
ssh-keygen -f id_rsa   # ...press enter for an empty password

# vi: set ft=ruby :
Vagrant.configure("2") do |config|

  config.vm.provider :libvirt do |libvirt|
    libvirt.memory = 1024
    libvirt.cpus = 1
    libvirt.qemu_use_session = false
  end

  config.vm.provision "file", 
    source: "id_rsa", destination: "/home/vagrant/.ssh/id_rsa"
  public_key = File.read("id_rsa.pub")

  config.vm.provision "shell", privileged: false, inline: <<-SHELL
     mkdir -p /home/vagrant/.ssh
     chmod 700 /home/vagrant/.ssh
     echo '#{public_key}' >> /home/vagrant/.ssh/authorized_keys
     chmod -R 600 /home/vagrant/.ssh/authorized_keys
     echo 'Host 192.168.*.*' >> /home/vagrant/.ssh/config
     echo 'StrictHostKeyChecking no' >> /home/vagrant/.ssh/config
     echo 'UserKnownHostsFile /dev/null' >> /home/vagrant/.ssh/config
     chmod -R 600 /home/vagrant/.ssh/config
  SHELL

  { 
    'adm1' => '172.28.128.10',
    'mon1' => '172.28.128.11',
    'mon2' => '172.28.128.12',
    'mon3' => '172.28.128.13',
    'osd1' => '172.28.128.14',
    'osd2' => '172.28.128.15',
    'osd3' => '172.28.128.16',
    'osd4' => '172.28.128.17',
    'osd5' => '172.28.128.18',
    'osd1' => '172.28.128.19',
    'osd2' => '172.28.128.20'
  }.each_pair do |name,ip|
    config.vm.define "#{name}" do |node|
      node.vm.hostname = name
      node.vm.box = "almalinux/9"
      node.vm.network :private_network, ip: ip
    end
  end

end

Footnotes

Pools, Ceph Documentation
https://docs.ceph.com/en/latest/rados/operations/pools ↩︎
Ceph File-System, Ceph Documentation
https://docs.ceph.com/en/latest/cephfs ↩︎
FS Volumes, Ceph Documentation
https://docs.ceph.com/en/latest/cephfs/fs-volumes/#fs-volumes ↩︎
CephFS Administrative Commands
https://docs.ceph.com/en/latest/cephfs/administration ↩︎

--- title: Ceph — Distributed Fault-tolerant Storage categories: - Storage date: 2015/07/10 date-modified: 2025/05/02 toc-expand: 2 --- # Overview Ceph (pronounced /ˈsɛf/) ...birds view... - ...open source LGPL ...web-site [ceph.io](https://ceph.io) - ...provides object, block, and file storage (POSIX) - ...completely distributed operation - ...no single point of failure (highly available) - ...replicates data with fault tolerance - ...data durability ...replication, erasure coding, snapshots and clones - ...self-healing and self-managing (minimal administration time/cost) - ...stable, named releases every 12 month... - ...backports for 2 releases ...upgrade up to 2 releases at a time - ...complete [release history][nHVV2] available on Wikipedia - ...18 major releases ...since 2012/06 [nHVV2]: https://en.wikipedia.org/wiki/Ceph_(software)#Release_history ### Governance Governance (state 2023)... - ...public [telemetry data][XRWEK] ...2.5k clusters ...>1EB storage - ...Ceph [governance model][5xRKP]... - ...executive council ...elected by steering committee - ...steering committee ...20+ members (predominantly Red Hat, SUSE, Intel) - ...leadership team ...group effort, shared leadership, open meetings - ...non-profit [**Ceph Foundation**][AKBNy] ...30+ organisations - ...neutral home for project stakeholders - ...organized as a directed fund **under the Linux Foundation** [XRWEK]: https://telemetry-public.ceph.com [5xRKP]: https://docs.ceph.com/en/latest/governance [AKBNy]: https://docs.ceph.com/en/latest/foundation ### Use-Cases Storage strategy... - ...store **data that serves a particular use case**, for example - ...store block-devices for OpenStack - ...object data for an RESTful S3-compliant or Swift-compliant gateway - ...container-based storage for Kubernetes - ...providing a POSIX file-system - ...influences **storage media selection**... - ...requires trade-off between performance vs. costs - ...faster vs. bigger vs. higher durability - ...for example... - ...SSD-backed OSDs for a high performance pool - ...SAS drive/SSD journal-backed OSDs ...high-performance block device volumes and images - ...SATA-backed OSDs for low cost storage Note: Ceph’s object copies or coding chunks handle durability ...make **RAID obsolete** Storage strategy requires to **assign Ceph OSDs to a CRUSH hierarchy**... - ...select appropriate storage media for a use-case ...identify appropriate OSDs - ...define CRUSH hierarchy...select storing placement groups - ...define CRUSH rule setting ...calculate placement groups (shards) - ...create pools ...select replicated or erasure-coded storage ...durability ### Hardware Identifying a price-to-performance profile suitable for targeted workload... - **IOPS optimized**... - ...block storage ...for example database instances as virtual machines on OpenStack - ...lowest cost per IOPS ...highest IOPS per GB ...99th percentile latency consistency - ...SSD storage (more expensive) ...improves IOPS & total throughput - ...15k RPM SAS drives ...separate SSD journals (handle frequent writes) - ...typically 3x replication for HDD ...2x replication for SSD - **Throughput optimized**... - ...fast data storage ...block or object storage - ...HDD with acceptable total throughput characteristics - ...lowest cost per MB ...highest MB/s per TB/BTU/Watt ...97th percentile latency consistency - ...SSD journals (more expensive) ...improve write performance - ...typically 3x replication - **Capacity optimized**... - ...(inexpensive) big data storage ...slower and less expensive SATA drives - ...lowest cost/BTU/Watt per TB - ...erasure coding common for maximizing usable capacity - ...co-locate journals ...cheaper then SSD journals **Multiple performance domains can coexist in a Ceph storage cluster** **OSD hardware within a pool requires to be identical** - ...avoid dissimilar hardware in the same pool! - Same... - ...controller, drive size, RPMs, seek times, I/O bandwidth - ...network throughput, journal configuration - ...simplifies provisioning ...streamlines troubleshooting **Network requires sufficient throughput during backfill and recovery operations**... - ...consider storage density when planing for rebalancing after hardware faults - ...prefer that a storage cluster recovers as quickly as possible - ...avoid IOPS starvation due to network latency ...be mindful of network link over-subscription - ...segregate intra-cluster traffic from client-to-cluster traffic - ...public network ...client traffic ...communication to Ceph Monitors - ...storage network ...Ceph OSD ...heartbeats, replication, backfilling, and recovery traffic - ...(public and storage cluster networks on separate network cards) - ...most performance-related problems in Ceph usually begin with a networking issue - ...jumbo frames for a better CPU per bandwidth ratio - ...non-blocking network switch back-plane - ...same MTU throughout all networking devices (both public and storage networks) ### Date Durability Ceph is fully consistent… - …data shared across availability zones (AZs), racks, nodes, disks - …replication configurable to failure domains - …even in extreme disasters, data can be recovered manually - **Erasure coding** support… - …commonly used for object storage - …works on block storage & shared file-systems # Architecture ####################################################### Ceph deployment consists of three types of daemons... - **Ceph OSD Daemon** (OSD) - ...stores data ...perform data replication, erasure coding - ...rebalancing, recovery, monitoring and reporting functions - ...store all data as objects in a flat namespace (no directory hierarchies) - ...object have cluster-wide unique identifier, binary data, and metadata - **Ceph Monitor** (MON) — Master copy of cluster map - **Ceph Manager** - ...information about placement groups - ...process metadata and host metadata - ...RESTful monitoring APIs - **Ceph Client** — Read/write data from the OSDs - ...retrieve the latest cluster map from MONs - ...computes object placement group and the primary OSD for data access - ...no intermediary server, broker or bus between the client and the OSD - ...clients define the semantics for the data format (i.e. block device) ## CRUSH Algorithm [CRUSH (Controlled Replication Under Scalable Hashing)][03j9J]... [03j9J]: https://docs.ceph.com/en/latest/architecture/#crush-introduction - ...track the location of storage objects - ...no central authority/lookup table (distributed hashing) - ...efficiently compute information about object location - ...enables massive scale ...**data location algorithmically computed on clients** **CRUSH maps**... - ...describes a topography of cluster resources - ...implemented as a simple hierarchy (acyclic graph) and a ruleset - ...the number of placement groups, and the pool interface - ...map exists both on client nodes as well as Ceph Monitor (MON) - ...MONs host master copy of the CRUSH maps - ...MONs not involved in data I/O path **CRUSH Ruleset**... - ..Ceph uses the CRUSH map to **implement failure domains**... - ...and performance domains for the storage media - ...supports multiple **hierarchies to separate one hardware performance profile from another** ## Pools Pools[^tcWdi] — Logical partitions …store RADOS objects [^tcWdi]: Pools, Ceph Documentation <https://docs.ceph.com/en/latest/rados/operations/pools> - ...administrators can create pools for particular types of data - **Pool type** - ...erasure coding ...data durability method is pool-wide - ...completely transparent to the client - ...contain a defined number of **placement groups** (PGs) - ...sharding a pool into placement groups - ...with a configured replication level * Attributes for ownership/access * Balance number of PGs per pool with the number of PGs per OSD - 50-200 PGs per OSD to balance out memory and CPU requirements and per-OSD load - Total Placement `Groups = (OSDs*(50-200))/Number of replica` ```bash ceph df # show usage ceph osd lspools # list pool numbers ceph osd pool ls detail # list pools with additional information ceph osd dump | grep ^pool # list replication size cpeh osd pool create <name> <pgs> <pgs> # create a pool ceph osd pool get <pool> <key> # read configuration attribute ceph osd pool set <pool> <key> <value> # set configuration attribute ``` Keys: - `size` number of replica objects - `min_size` minimum number of replica available for IO - `pg_num`,`pgp_num` (effective) number of PGs to use when calculating data placement - `crush_ruleset` to use for mapping object placement in the cluster (C States: - `active+clean` – optimum state - `degraded` – not enough replicas to meet requirement - `down` – no OSD available storing PG - `inconsistent` – PG nor consistent across different OSDs - `repair` – correcting inconsistent PGs to meet replication requirements Delete a pool ```bash # pool name twice ceph osd pool delete $pool $pool --yes-i-really-really-mean-it ``` ## Monitor Servers (MONs) Datastore for the health of the entire cluster... - ...hosts master copy of the CRUSH, OSD & MON maps - ...**minimum of three MONs for a cluster quorum in production** - ...there must be on odd number >=3 to provide consensus... - ...for the use of distributed decision-making with the Paxos algorithm - ...non-quorate pools become unavailable * **OSD map** reflect the OSD daemons operating in the cluster ```bash systemctl status ceph-mon@$(hostname) # daemon state ceph-mon -d -i $(hostname) # run daemon in foreground /var/log/ceph/ceph-mon.$(hostname).log # default log location ceph mon stat # state of all monitors ceph -f json-pretty quorum_status # quorum information ceph-conf --name mon.$(hostname) --show-config-value log_file # get location of the log file ceph --admin-daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status # access a monitors admin socket ``` Manipulate the MON map: ```bash monmap=/tmp/monmap.bin ceph mon getmap -o $monmap # print the MON map if a quorum exists ceph-mon --extract-monmap $monmap -i $(hostname) # extract the MON map of ceph-mon stopped monmaptool $monmap --print # print MON map from file monmaptool $monmap --clobber --rm lxmon03 # remove a monitor ceph-mon --inject-monmap $monmap -i $(hostname) # store new map ``` ## Object Storage Devices (OSDs) Node hosting physical storage: * `ceph-osd` is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network * Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes * OSD daemons are numerically identified: `osd.<id>`, and have following states `in`/`out` of the cluster, and `up`/`down` * **Primary** OSDs are responsible for replication, coherency, re-balancing and recovery * **Secondery** OSDs are under control of a primary OSD, and can become primary * Small, random I/O written sequentially to local **OSD Jounral** (improve performance with SSD) - Enables atomic updates to objects, and replay of the journal on restart - Write `ACK` when minimum replica journals written - Journal sync to file-system periodically to reclaim space ```bash /var/log/ceph/*.log # logfiles on storage servers ceph osd stat # show state cpeh osd df # storage utilization by OSD ceph node ls osd # list OSDs ceph osd dump # show OSD map # OSD ID to host IP, port mapping ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14 ceph osd tree # OSD/CRUSH map as tree ceph osd getmaxosd # show number of available OSDs ceph osd map <pool> <objname> # identify object location /var/run/ceph/* # admin socket ceph daemon <socket> <command> # use the admin socket ceph daemon <socket> status # show identity ``` Manipulate the CRUSH map: ```bash ceph osd getcrushmap -o /tmp/crushmap.bin # extract teh CRUSH map crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt # decompile binary CRUSH map crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin # compile CRUSH map ceph osd setcrushmap -i /tmp/crushmap.bin # inject CRUSH map ``` Remove an OSD from production: ```bash ceph osd out <num> # remove OSD, begin rebalancing systemctl stop ceph-osd@<num> # stop the OSD daemon on the server ceph osd crush remove osd.<num> # remove the OSD from the CRUSH map ceph auth del osd.<num> # remove authorization credential ceph osd rm osd.<num> # remove OSD from configuration # remove from clean ceph.conf if required ``` ## Placement Groups (PGs) Logical collection of objects: * CRUSH assigns objects to placement groups * CRUSH assigns a placement group to a primary OSD * The Primary OSD use CRUSH to replicate the PGs to the secondary OSDs * Re-balance/recover works on all objects in a placement group * Placement Group IDs (pgid) `<pool_num>.<pg_id>` (hexadecimal) ```sh ceph pg stat # show state ceph pg ls-by-osd $num # lost PGs store on OSD ceph pg map $pgid # map a placement group ceph pg dump_stuck inactive|unclean|stale|undersized|degraded # statisitcs for stuck PGs ceph pg $pgid query # statistics for a particular placement group ceph pg scrub $pgid # check primary and replicas ``` ```sh osdmap e20 pg 2.f22 (2.22) -> up [0,1,2] acting [0,1,2] │ │ │ └── acting set of OSDs responsible for a particular PG │ │ └───────────── matches "acting set" unless rebalancing in progress │ └──────────────────────────────── placement group number └──────────────────────────────────── monotonically increasing OSD map version number ``` Recommended Number of PGs `pg_num` in relation to the number of OSDs ``` OSDs PGs <5 128 5~10 512 10~50 4096 >50 50-100 per OSD ``` ## Object Storage Devices (OSDs) Node hosting physical storage: * `ceph-osd` is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network * Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes * OSD daemons are numerically identified: `osd.<id>`, and have following states `in`/`out` of the cluster, and `up`/`down` * **Primary** OSDs are responsible for replication, coherency, re-balancing and recovery * **Secondery** OSDs are under control of a primary OSD, and can become primary * Small, random I/O written sequentially to local **OSD Jounral** (improve performance with SSD) - Enables atomic updates to objects, and replay of the journal on restart - Write `ACK` when minimum replica journals written - Journal sync to file-system periodically to reclaim space ```bash /var/log/ceph/*.log # logfiles on storage servers ceph osd stat # show state cpeh osd df # storage utilization by OSD ceph node ls osd # list OSDs ceph osd dump # show OSD map # OSD ID to host IP, port mapping ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14 ceph osd tree # OSD/CRUSH map as tree ceph osd getmaxosd # show number of available OSDs ceph osd map <pool> <objname> # identify object location /var/run/ceph/* # admin socket ceph daemon <socket> <command> # use the admin socket ceph daemon <socket> status # show identity ``` Manipulate the CRUSH map: ```bash ceph osd getcrushmap -o /tmp/crushmap.bin # extract teh CRUSH map crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt # decompile binary CRUSH map crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin # compile CRUSH map ceph osd setcrushmap -i /tmp/crushmap.bin # inject CRUSH map ``` Remove an OSD from production: ```bash ceph osd out <num> # remove OSD, begin rebalancing systemctl stop ceph-osd@<num> # stop the OSD daemon on the server ceph osd crush remove osd.<num> # remove the OSD from the CRUSH map ceph auth del osd.<num> # remove authorization credential ceph osd rm osd.<num> # remove OSD from configuration # remove from clean ceph.conf if required ``` ## Ceph File-System CephFS[^d45LK] — POSIX file-system on top of Ceph RADOS - Metadata is stored in a RADOS pool separate from file data… - …served via a resizable cluster of Metadata Servers (MDS) [^d45LK]: Ceph File-System, Ceph Documentation <https://docs.ceph.com/en/latest/cephfs> ### Volumes Abstraction for CephFS file systems[^lkP98]… [^lkP98]: FS Volumes, Ceph Documentation <https://docs.ceph.com/en/latest/cephfs/fs-volumes/#fs-volumes> * **Volumes** - …represent the entire CephFS file system - …foundational layer for file-based storage * **Subvolumes** - …data organization & quota management - …per application (tenant) directory trees * **Subvolume Groups** - …policies across groups of subvolumes - …enforce administrative rules Volume management is crucial for optimizing CephFS performance and security * _Why use sub volumes?_ - …used to **support Container Storage Interface (CSI) volumes** - …abstraction for independent CephFS directory trees * …recommended to consolidate file system workloads onto single CephFS ```bash # list volumes ceph fs volume ls # create a CephFS volume on the monitor node ceph fs volume create $vol_name # information of a CephFS volume ceph fs volume info $vol_name --human_readable # delete a volume ceph fs volume rm $vol_name --yes-i-really-mean-it ``` CephFS subvolume groups — Abstraction at a directory level ```bash ceph fs subvolumegroup ls $vol_name # absolute path of the CephFS subvolume group ceph fs subvolumegroup getpath $vol_name $group_name ceph fs subvolumegroup snapshot ls $vol_name $group_name ``` # Deployment ############################################ **Recommend deployment method is [`cephadm`][P2Fk3]**... [P2Fk3]: https://docs.ceph.com/en/latest/cephadm - ...preferred bare-metal installation and management method - [cephadm-ansible][Mll0e] ...simplify workflows that are not covered by `cephadm` - Alternatives ...[manual installation][5DbKq] - ...[ceph-ansible][Iksf6] still maintained ...not integrated with the orchestrator APIs - ...[Rock](https://rook.io/) for deployment in Kubernetes - [Deploying a new Ceph cluster][qk8xn]... - ...requirements ...Python 3, Systemd, Podman (or Docker), NTP, LVM2 - ...list supported version at [latest releases][z2gPX] [Iksf6]: https://github.com/ceph/ceph-ansible [5DbKq]: https://docs.ceph.com/en/latest/install/index_manual/#install-manual [Mll0e]: https://github.com/ceph/cephadm-ansible ## Packages RPM packages... - ...[download.ceph.com](https://download.ceph.com/) host RPM package for Enterprise Linux - ...Fedora [cephadm][Hj7pT] package [qk8xn]: https://docs.ceph.com/en/latest/cephadm/install/#cephadm-deploying-new-cluster [z2gPX]: https://docs.ceph.com/en/latest/releases/#active-releases [Hj7pT]: https://packages.fedoraproject.org/pkgs/ceph/cephadm/index.html ```sh # ...package provided by the Fedora community dnf -y install cephadm # ...packages provided by the Ceph community curl --silent --remote-name --location https://download.ceph.com/rpm-18.2.0/el9/noarch/cephadm chmod +x cephadm ; ./cephadm add-repo --release reef ; ./cephadm install ``` ...alternatively [add the package repository manually][Ec0WH]... [Ec0WH]: https://docs.ceph.com/en/latest/install/get-packages/#rhel ```sh sudo dnf install -y epel-release # ...for Enterprise Linux 9 sudo rpm --import 'https://download.ceph.com/keys/release.asc' sudo dnf install -y https://download.ceph.com/rpm-18.2.0/el9/noarch/ceph-release-1-1.el9.noarch.rpm dnf makecache ; sudo dnf install -y cephadm ``` ## `cephadm` Utility that is used to manage a Ceph cluster - ...manages the full lifecycle of a Ceph cluster - ...does not rely on external configuration tools (like Ansible) - ...**builds on the orchestrator API** - ...uses SSH connections to interact with hosts - ...deploys services using standard container images - ...sequence is started from the command line on the first host of the cluster # Operations ```sh ceph status # summery of state ceph -s ceph health detail ceph -w # watch mode ceph-deploy --overwrite-conf config push <node> # update the configuration after changes ``` ## Authentication `cephx` authentication protocol (...similar to Kerberos) - ...MONs can authenticate users and distribute keys - ...no single point of failure or bottleneck - ...return a sessions key (tickets) to obtain Ceph services - ...Ceph monitors and OSDs share a secret Administrator must set up users... ```sh # lists authentication state ceph auth list # create a new user ceph auth get-or-create client.<name> <caps> ``` ## CephFS CephFS — POSIX-compliant distributed file-system… - …with Linux kernel client & FUSE client - …the RADOS store hosts metadata & data pools - **MDSs** (Metadata Server Daemons) — Coordinate clients data access Operate on the CephFS file-systems[^4JEIA] in your Ceph cluster: [^4JEIA]: CephFS Administrative Commands <https://docs.ceph.com/en/latest/cephfs/administration> ```bash # list all file-system names ceph fs ls # destroy a file-system ceph fs rm $name --yes-i-really-mean-it ``` # Clients Client are distributed in the `ceph-common` package ```bash /etc/ceph/ceph.conf # default location for configuration /etc/ceph/ceph.client.admin.keyring # default location for admin key ``` ## RADOS * Objects store data and have: a name, the payload (data), and attributes * Object namespace is flat ```bash rados lspools # list pools rados df # show pool statistics rados mkpool <pool> # create a new pool rados rmpool <pool> # delete a pool rados ls -p <pool> # list contents of pool rados -p <pool> put <objname> <file> # store object rados -p <pool> rm <objname> # delete object ``` ## RBD **Block interface** on top of RADOS: - Images (single block devices) striped over multiple RADOS objects/OSDs - The default pool is called `rbd` ```bash mod(probe|info) rbd # kernel module /sys/bus/rbd/ # live module data rbd ls [-l] [<pool>] # list block devices rbd info [<pool>/]<name> # introspec image rbd du [<name>] # list size of images rbd create --size <MB> [<pool>/]<name> # create image rbd map <name> # map image to device rbd showmapped # list device mappring rbd unmap <name> # rbd rm [<pool>/]<name> # remove image ``` **Snapshots** - Read-only copy of the state of an image at a particular point in time - Layering allows for cloning snapshots - Stop I/O before taking a snapshot (internal file-systems must be in a consistent state) - Rolling back overwrites the current version of the image - It is faster to clone from a snapshot than to rollback to return to an pre-existing state ```bash rbd snap create <pool>/<image>@<name> # snapshot image rbd snap ls <pool>/<image> # list snapshots of image rbd snap rollback <pool>/<image>@<name> # rollback to snapshot rbd snap rm <pool>/<image>@<name> # delete snapshot rbd snap purge <pool>/<image> # delete all snapshots rbd snap protect <pool>/<image>@<name> # protect snapshot before cloneing rbd snap unprotect <pool>/<image>@<name> # revert protection rbd clone <pool>/<image>@<name> <pool>/<image> # clone snapshot to new image rbd children <pool>/<image>@<name> # list snapshot decendents rbd flatten <pool>/<image> # decouple from parent snapshot ``` ## Libvirt Configure a pool... ```sh ceph osd pool create libvirt 128 128 # create a pool for libvirt ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt' # create a user for libvirt virsh secret-define --file secret.xml # define the secret uuid=$(virsh secret-list | grep client.libvirt | cut -d' ' -f 2) # uuid of the secret key=$(ceph auth get-key client.libvirt) # ceph lcient key virsh secret-set-value --secret "$uuid" --base64 "$key" # set the UUID of the secret ``` Libvirt `secret.xml` associated with a Ceph RBD ```xml <secret ephemeral='no' private='no'> <usage type='ceph'> <name>client.libvirt secret</name> </usage> </secret> ``` Usage ```bash qemu-img create -f raw rbd:<pool>/<image> <size> # create a new image block device qemu-img info rbd:<pool>/<image> # block device states qemu-img resize rbd:<pool>/<image> <size> # adjust image size # create new RBD image from source qcow2 image, and vice versa qemu-img convert -f qcow2 <path> -O rbd rbd:<pool>/<image> qemu-img convert -f rbd rbd:<pool>/<image> -O qcow2 <path> ``` RBD image as `<disk>` entry for a virtual machine; replace: - `source/name=` with `<pool>/<image>` - `host/name=` with the IP address of a monitor server (multiples supported) - `secret/uuid=` with the corresponding libvirt secret UUID ```xml <disk type="network" device="disk"> <source protocol='rbd' name='libvirt/lxdev01.devops.test'> <host name='10.1.1.22' port='6789'/> </source> <auth username='libvirt'> <secret type='ceph' uuid='ec8a08ff-59b2-49fd-8162-532c9b0a2ed6'/> </auth> <target dev="vda" bus="virtio"/> </disk> ``` ```bash ceph osd map <pool> <image> # map image to PGs virsh qemu-monitor-command --hmp <instance> 'info block' # configuration of the VM instance virsh dumpxml <instance> | xmlstarlet el -v | grep rbd # get attached RBD image ``` # Test ```sh # work in a disposable path pushd $(mktemp -d /tmp/$USER-vagrant-XXXXXX) ssh-keygen -f id_rsa # ...press enter for an empty password ``` ```rb # vi: set ft=ruby : Vagrant.configure("2") do |config| config.vm.provider :libvirt do |libvirt| libvirt.memory = 1024 libvirt.cpus = 1 libvirt.qemu_use_session = false end config.vm.provision "file", source: "id_rsa", destination: "/home/vagrant/.ssh/id_rsa" public_key = File.read("id_rsa.pub") config.vm.provision "shell", privileged: false, inline: <<-SHELL mkdir -p /home/vagrant/.ssh chmod 700 /home/vagrant/.ssh echo '#{public_key}' >> /home/vagrant/.ssh/authorized_keys chmod -R 600 /home/vagrant/.ssh/authorized_keys echo 'Host 192.168.*.*' >> /home/vagrant/.ssh/config echo 'StrictHostKeyChecking no' >> /home/vagrant/.ssh/config echo 'UserKnownHostsFile /dev/null' >> /home/vagrant/.ssh/config chmod -R 600 /home/vagrant/.ssh/config SHELL { 'adm1' => '172.28.128.10', 'mon1' => '172.28.128.11', 'mon2' => '172.28.128.12', 'mon3' => '172.28.128.13', 'osd1' => '172.28.128.14', 'osd2' => '172.28.128.15', 'osd3' => '172.28.128.16', 'osd4' => '172.28.128.17', 'osd5' => '172.28.128.18', 'osd1' => '172.28.128.19', 'osd2' => '172.28.128.20' }.each_pair do |name,ip| config.vm.define "#{name}" do |node| node.vm.hostname = name node.vm.box = "almalinux/9" node.vm.network :private_network, ip: ip end end end ```