Ceph — Distributed Fault-tolerant Storage
Overview
Ceph (pronounced /ˈsɛf/) …birds view…
- …open source LGPL …web-site ceph.io
- …provides object, block, and file storage (POSIX)
- …completely distributed operation
- …no single point of failure (highly available)
- …replicates data with fault tolerance
- …data durability …replication, erasure coding, snapshots and clones
- …self-healing and self-managing (minimal administration time/cost)
 
- …stable, named releases every 12 month…
- …backports for 2 releases …upgrade up to 2 releases at a time
- …complete release history available on Wikipedia
- …18 major releases …since 2012/06
 
Governance
Governance (state 2023)…
- …public telemetry data …2.5k clusters …>1EB storage
- …Ceph governance model…
- …executive council …elected by steering committee
- …steering committee …20+ members (predominantly Red Hat, SUSE, Intel)
- …leadership team …group effort, shared leadership, open meetings
 
- …non-profit Ceph Foundation …30+ organisations
- …neutral home for project stakeholders
- …organized as a directed fund under the Linux Foundation
 
Use-Cases
Storage strategy…
- …store data that serves a particular use case, for example
- …store block-devices for OpenStack
- …object data for an RESTful S3-compliant or Swift-compliant gateway
- …container-based storage for Kubernetes
- …providing a POSIX file-system
 
- …influences storage media selection…
- …requires trade-off between performance vs. costs
- …faster vs. bigger vs. higher durability
- …for example…
- …SSD-backed OSDs for a high performance pool
- …SAS drive/SSD journal-backed OSDs …high-performance block device volumes and images
- …SATA-backed OSDs for low cost storage
 
 
Note: Ceph’s object copies or coding chunks handle durability …make RAID obsolete
Storage strategy requires to assign Ceph OSDs to a CRUSH hierarchy…
- …select appropriate storage media for a use-case …identify appropriate OSDs
- …define CRUSH hierarchy…select storing placement groups
- …define CRUSH rule setting …calculate placement groups (shards)
- …create pools …select replicated or erasure-coded storage …durability
Hardware
Identifying a price-to-performance profile suitable for targeted workload…
- IOPS optimized…
- …block storage …for example database instances as virtual machines on OpenStack
- …lowest cost per IOPS …highest IOPS per GB …99th percentile latency consistency
- …SSD storage (more expensive) …improves IOPS & total throughput
- …15k RPM SAS drives …separate SSD journals (handle frequent writes)
- …typically 3x replication for HDD …2x replication for SSD
 
- Throughput optimized…
- …fast data storage …block or object storage
- …HDD with acceptable total throughput characteristics
- …lowest cost per MB …highest MB/s per TB/BTU/Watt …97th percentile latency consistency
- …SSD journals (more expensive) …improve write performance
- …typically 3x replication
 
- Capacity optimized…
- …(inexpensive) big data storage …slower and less expensive SATA drives
- …lowest cost/BTU/Watt per TB
- …erasure coding common for maximizing usable capacity
- …co-locate journals …cheaper then SSD journals
 
Multiple performance domains can coexist in a Ceph storage cluster
OSD hardware within a pool requires to be identical
- …avoid dissimilar hardware in the same pool!
- Same…
- …controller, drive size, RPMs, seek times, I/O bandwidth
- …network throughput, journal configuration
 
- …simplifies provisioning …streamlines troubleshooting
Network requires sufficient throughput during backfill and recovery operations…
- …consider storage density when planing for rebalancing after hardware faults
- …prefer that a storage cluster recovers as quickly as possible
- …avoid IOPS starvation due to network latency …be mindful of network link over-subscription
- …segregate intra-cluster traffic from client-to-cluster traffic
- …public network …client traffic …communication to Ceph Monitors
- …storage network …Ceph OSD …heartbeats, replication, backfilling, and recovery traffic
- …(public and storage cluster networks on separate network cards)
 
- …most performance-related problems in Ceph usually begin with a networking issue
- …jumbo frames for a better CPU per bandwidth ratio
- …non-blocking network switch back-plane
- …same MTU throughout all networking devices (both public and storage networks)
 
Date Durability
Ceph is fully consistent…
- …data shared across availability zones (AZs), racks, nodes, disks
- …replication configurable to failure domains
- …even in extreme disasters, data can be recovered manually
- Erasure coding support…
- …commonly used for object storage
- …works on block storage & shared file-systems
 
Architecture
Ceph deployment consists of three types of daemons…
- Ceph OSD Daemon (OSD)
- …stores data …perform data replication, erasure coding
- …rebalancing, recovery, monitoring and reporting functions
- …store all data as objects in a flat namespace (no directory hierarchies)
- …object have cluster-wide unique identifier, binary data, and metadata
 
- Ceph Monitor (MON) — Master copy of cluster map
- Ceph Manager
- …information about placement groups
- …process metadata and host metadata
- …RESTful monitoring APIs
 
- Ceph Client — Read/write data from the OSDs
- …retrieve the latest cluster map from MONs
- …computes object placement group and the primary OSD for data access
- …no intermediary server, broker or bus between the client and the OSD
- …clients define the semantics for the data format (i.e. block device)
 
CRUSH Algorithm
CRUSH (Controlled Replication Under Scalable Hashing)…
- …track the location of storage objects
- …no central authority/lookup table (distributed hashing)
- …efficiently compute information about object location
- …enables massive scale …data location algorithmically computed on clients
CRUSH maps…
- …describes a topography of cluster resources
- …implemented as a simple hierarchy (acyclic graph) and a ruleset
- …the number of placement groups, and the pool interface
- …map exists both on client nodes as well as Ceph Monitor (MON)
- …MONs host master copy of the CRUSH maps
- …MONs not involved in data I/O path
 
CRUSH Ruleset…
- ..Ceph uses the CRUSH map to implement failure domains…
- …and performance domains for the storage media
- …supports multiple hierarchies to separate one hardware performance profile from another
 
Pools
Pools1 — Logical partitions …store RADOS objects
- …administrators can create pools for particular types of data
- Pool type
- …erasure coding …data durability method is pool-wide
- …completely transparent to the client
 
- …contain a defined number of placement groups (PGs)
- …sharding a pool into placement groups
- …with a configured replication level
 
- Attributes for ownership/access
- Balance number of PGs per pool with the number of PGs per OSD
- 50-200 PGs per OSD to balance out memory and CPU requirements and per-OSD load
- Total Placement Groups = (OSDs*(50-200))/Number of replica
 
ceph df                                              # show usage
ceph osd lspools                                     # list pool numbers
ceph osd pool ls detail                              # list pools with additional information
ceph osd dump | grep ^pool                           # list replication size
cpeh osd pool create <name> <pgs> <pgs>              # create a pool 
ceph osd pool get <pool> <key>                       # read configuration attribute
ceph osd pool set <pool> <key> <value>               # set configuration attributeKeys:
- sizenumber of replica objects
- min_sizeminimum number of replica available for IO
- pg_num,- pgp_num(effective) number of PGs to use when calculating data placement
- crush_rulesetto use for mapping object placement in the cluster (C
States:
- active+clean– optimum state
- degraded– not enough replicas to meet requirement
- down– no OSD available storing PG
- inconsistent– PG nor consistent across different OSDs
- repair– correcting inconsistent PGs to meet replication requirements
Delete a pool
# pool name twice
ceph osd pool delete $pool $pool --yes-i-really-really-mean-itMonitor Servers (MONs)
Datastore for the health of the entire cluster…
- …hosts master copy of the CRUSH, OSD & MON maps
- …minimum of three MONs for a cluster quorum in production
- …there must be on odd number >=3 to provide consensus…
- …for the use of distributed decision-making with the Paxos algorithm
- …non-quorate pools become unavailable
 
- OSD map reflect the OSD daemons operating in the cluster
systemctl status ceph-mon@$(hostname)                 # daemon state       
ceph-mon -d -i $(hostname)                            # run daemon in foreground
/var/log/ceph/ceph-mon.$(hostname).log                # default log location
ceph mon stat                                         # state of all monitors
ceph -f json-pretty quorum_status                     # quorum information
ceph-conf --name mon.$(hostname) --show-config-value log_file
                                                      # get location of the log file
ceph --admin-daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status
                                                      # access a monitors admin socketManipulate the MON map:
monmap=/tmp/monmap.bin
ceph mon getmap -o $monmap                            # print the MON map if a quorum exists
ceph-mon --extract-monmap $monmap -i $(hostname)      # extract the MON map of ceph-mon stopped
monmaptool $monmap --print                            # print MON map from file
monmaptool $monmap --clobber --rm lxmon03             # remove a monitor
ceph-mon --inject-monmap $monmap -i $(hostname)       # store new mapObject Storage Devices (OSDs)
Node hosting physical storage:
- ceph-osdis the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network
- Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
- OSD daemons are numerically identified: osd.<id>, and have following statesin/outof the cluster, andup/down
- Primary OSDs are responsible for replication, coherency, re-balancing and recovery
- Secondery OSDs are under control of a primary OSD, and can become primary
- Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
- Enables atomic updates to objects, and replay of the journal on restart
- Write ACKwhen minimum replica journals written
- Journal sync to file-system periodically to reclaim space
 
/var/log/ceph/*.log                                  # logfiles on storage servers    
ceph osd stat                                        # show state 
cpeh osd df                                          # storage utilization by OSD
ceph node ls osd                                     # list OSDs
ceph osd dump                                        # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree                                        # OSD/CRUSH map as tree
ceph osd getmaxosd                                   # show number of available OSDs
ceph osd map <pool> <objname>                        # identify object location
/var/run/ceph/*                                      # admin socket
ceph daemon <socket> <command>                       # use the admin socket
ceph daemon <socket> status                          # show identityManipulate the CRUSH map:
ceph osd getcrushmap -o /tmp/crushmap.bin                    # extract teh CRUSH map 
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt          # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin          # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin                    # inject CRUSH mapRemove an OSD from production:
ceph osd out <num>                                   # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num>                        # stop the OSD daemon on the server 
ceph osd crush remove osd.<num>                      # remove the OSD from the CRUSH map
ceph auth del osd.<num>                              # remove authorization credential
ceph osd rm osd.<num>                                # remove OSD from configuration
# remove from clean ceph.conf if required Placement Groups (PGs)
Logical collection of objects:
- CRUSH assigns objects to placement groups
- CRUSH assigns a placement group to a primary OSD
- The Primary OSD use CRUSH to replicate the PGs to the secondary OSDs
- Re-balance/recover works on all objects in a placement group
- Placement Group IDs (pgid) <pool_num>.<pg_id>(hexadecimal)
ceph pg stat                                         # show state
ceph pg ls-by-osd $num                               # lost PGs store on OSD
ceph pg map $pgid                                    # map a placement group
ceph pg dump_stuck inactive|unclean|stale|undersized|degraded
                                                     # statisitcs for stuck PGs
ceph pg $pgid query                                  # statistics for a particular placement group
ceph pg scrub $pgid                                  # check primary and replicasosdmap e20 pg 2.f22 (2.22) -> up [0,1,2] acting [0,1,2]
       │   │                  │          └── acting set of OSDs responsible for a particular PG
       │   │                  └───────────── matches "acting set" unless rebalancing in progress 
       │   └──────────────────────────────── placement group number                   
       └──────────────────────────────────── monotonically increasing OSD map version numberRecommended Number of PGs pg_num in relation to the number of OSDs
OSDs       PGs
<5         128
5~10       512
10~50      4096
>50        50-100 per OSDObject Storage Devices (OSDs)
Node hosting physical storage:
- ceph-osdis the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network
- Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
- OSD daemons are numerically identified: osd.<id>, and have following statesin/outof the cluster, andup/down
- Primary OSDs are responsible for replication, coherency, re-balancing and recovery
- Secondery OSDs are under control of a primary OSD, and can become primary
- Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
- Enables atomic updates to objects, and replay of the journal on restart
- Write ACKwhen minimum replica journals written
- Journal sync to file-system periodically to reclaim space
 
/var/log/ceph/*.log                                  # logfiles on storage servers    
ceph osd stat                                        # show state 
cpeh osd df                                          # storage utilization by OSD
ceph node ls osd                                     # list OSDs
ceph osd dump                                        # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree                                        # OSD/CRUSH map as tree
ceph osd getmaxosd                                   # show number of available OSDs
ceph osd map <pool> <objname>                        # identify object location
/var/run/ceph/*                                      # admin socket
ceph daemon <socket> <command>                       # use the admin socket
ceph daemon <socket> status                          # show identityManipulate the CRUSH map:
ceph osd getcrushmap -o /tmp/crushmap.bin                    # extract teh CRUSH map 
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt          # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin          # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin                    # inject CRUSH mapRemove an OSD from production:
ceph osd out <num>                                   # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num>                        # stop the OSD daemon on the server 
ceph osd crush remove osd.<num>                      # remove the OSD from the CRUSH map
ceph auth del osd.<num>                              # remove authorization credential
ceph osd rm osd.<num>                                # remove OSD from configuration
# remove from clean ceph.conf if required Ceph File-System
CephFS2 — POSIX file-system on top of Ceph RADOS
- Metadata is stored in a RADOS pool separate from file data…
- …served via a resizable cluster of Metadata Servers (MDS)
Volumes
Abstraction for CephFS file systems3…
- Volumes
- …represent the entire CephFS file system
- …foundational layer for file-based storage
 
- Subvolumes
- …data organization & quota management
- …per application (tenant) directory trees
 
- Subvolume Groups
- …policies across groups of subvolumes
- …enforce administrative rules
 
Volume management is crucial for optimizing CephFS performance and security
- Why use sub volumes?
- …used to support Container Storage Interface (CSI) volumes
- …abstraction for independent CephFS directory trees
 
- …recommended to consolidate file system workloads onto single CephFS
# list volumes
ceph fs volume ls
# create a CephFS volume on the monitor node
ceph fs volume create $vol_name
# information of a CephFS volume
ceph fs volume info $vol_name --human_readable
# delete a volume
ceph fs volume rm $vol_name --yes-i-really-mean-itCephFS subvolume groups — Abstraction at a directory level
ceph fs subvolumegroup ls $vol_name
# absolute path of the CephFS subvolume group
ceph fs subvolumegroup getpath $vol_name $group_name
ceph fs subvolumegroup snapshot ls $vol_name $group_nameDeployment
Recommend deployment method is cephadm…
- …preferred bare-metal installation and management method
- cephadm-ansible …simplify workflows that are not covered by cephadm
- Alternatives …manual installation
- …ceph-ansible still maintained …not integrated with the orchestrator APIs
- …Rock for deployment in Kubernetes
 
- Deploying a new Ceph cluster…
- …requirements …Python 3, Systemd, Podman (or Docker), NTP, LVM2
- …list supported version at latest releases
 
Packages
RPM packages…
- …download.ceph.com host RPM package for Enterprise Linux
- …Fedora cephadm package
# ...package provided by the Fedora community
dnf -y install cephadm
# ...packages provided by the Ceph community
curl --silent --remote-name --location https://download.ceph.com/rpm-18.2.0/el9/noarch/cephadm
chmod +x cephadm ; ./cephadm add-repo --release reef ; ./cephadm install…alternatively add the package repository manually…
sudo dnf install -y epel-release
# ...for Enterprise Linux 9
sudo rpm --import 'https://download.ceph.com/keys/release.asc'
sudo dnf install -y https://download.ceph.com/rpm-18.2.0/el9/noarch/ceph-release-1-1.el9.noarch.rpm
dnf makecache ; sudo dnf install -y cephadmcephadm
Utility that is used to manage a Ceph cluster
- …manages the full lifecycle of a Ceph cluster
- …does not rely on external configuration tools (like Ansible)
- …builds on the orchestrator API
- …uses SSH connections to interact with hosts
- …deploys services using standard container images
- …sequence is started from the command line on the first host of the cluster
Operations
ceph status                                          # summery of state
ceph -s
ceph health detail
ceph -w                                              # watch mode
ceph-deploy --overwrite-conf config push <node>      # update the configuration after changesAuthentication
cephx authentication protocol (…similar to Kerberos)
- …MONs can authenticate users and distribute keys
- …no single point of failure or bottleneck
- …return a sessions key (tickets) to obtain Ceph services
 
- …Ceph monitors and OSDs share a secret
Administrator must set up users…
# lists authentication state
ceph auth list
# create a new user 
ceph auth get-or-create client.<name> <caps>CephFS
CephFS — POSIX-compliant distributed file-system…
- …with Linux kernel client & FUSE client
- …the RADOS store hosts metadata & data pools
- MDSs (Metadata Server Daemons) — Coordinate clients data access
Operate on the CephFS file-systems4 in your Ceph cluster:
# list all file-system names
ceph fs ls
# destroy a file-system
ceph fs rm $name --yes-i-really-mean-itClients
Client are distributed in the ceph-common package
/etc/ceph/ceph.conf                                  # default location for configuration
/etc/ceph/ceph.client.admin.keyring                  # default location for admin key RADOS
- Objects store data and have: a name, the payload (data), and attributes
- Object namespace is flat
rados lspools                                        # list pools
rados df                                             # show pool statistics
rados mkpool <pool>                                  # create a new pool     
rados rmpool <pool>                                  # delete a pool
rados ls -p <pool>                                   # list contents of pool
rados -p <pool> put <objname> <file>                 # store object
rados -p <pool> rm <objname>                         # delete objectRBD
Block interface on top of RADOS:
- Images (single block devices) striped over multiple RADOS objects/OSDs
- The default pool is called rbd
mod(probe|info) rbd                                  # kernel module
/sys/bus/rbd/                                        # live module data
rbd ls [-l] [<pool>]                                 # list block devices
rbd info [<pool>/]<name>                             # introspec image
rbd du [<name>]                                      # list size of images
rbd create --size <MB> [<pool>/]<name>               # create image
rbd map <name>                                       # map image to device
rbd showmapped                                       # list device mappring
rbd unmap <name>                                     # 
rbd rm [<pool>/]<name>                               # remove imageSnapshots
- Read-only copy of the state of an image at a particular point in time
- Layering allows for cloning snapshots
- Stop I/O before taking a snapshot (internal file-systems must be in a consistent state)
- Rolling back overwrites the current version of the image
- It is faster to clone from a snapshot than to rollback to return to an pre-existing state
rbd snap create <pool>/<image>@<name>                # snapshot image
rbd snap ls <pool>/<image>                           # list snapshots of image
rbd snap rollback <pool>/<image>@<name>              # rollback to snapshot
rbd snap rm <pool>/<image>@<name>                    # delete snapshot
rbd snap purge <pool>/<image>                        # delete all snapshots
rbd snap protect <pool>/<image>@<name>               # protect snapshot before cloneing
rbd snap unprotect <pool>/<image>@<name>             # revert protection
rbd clone <pool>/<image>@<name> <pool>/<image>       # clone snapshot to new image
rbd children <pool>/<image>@<name>                   # list snapshot decendents
rbd flatten <pool>/<image>                           # decouple from parent snapshotLibvirt
Configure a pool…
ceph osd pool create libvirt 128 128                 # create a pool for libvirt
ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt'
                                                     # create a user for libvirt
virsh secret-define --file secret.xml                # define the secret
uuid=$(virsh secret-list | grep client.libvirt | cut -d' ' -f 2)
                                                     # uuid of the secret
key=$(ceph auth get-key client.libvirt)              # ceph lcient key
virsh secret-set-value --secret "$uuid" --base64 "$key" 
                                                     # set the UUID of the secretLibvirt secret.xml associated with a Ceph RBD
<secret ephemeral='no' private='no'>
  <usage type='ceph'>
    <name>client.libvirt secret</name>
  </usage>
</secret>Usage
qemu-img create -f raw rbd:<pool>/<image> <size>     # create a new image block device
qemu-img info rbd:<pool>/<image>                     # block device states
qemu-img resize rbd:<pool>/<image> <size>            # adjust image size
# create new RBD image from source qcow2 image, and vice versa
qemu-img convert -f qcow2 <path> -O rbd rbd:<pool>/<image>
qemu-img convert -f rbd rbd:<pool>/<image> -O qcow2 <path>RBD image as <disk> entry for a virtual machine; replace:
- source/name=with- <pool>/<image>
- host/name=with the IP address of a monitor server (multiples supported)
- secret/uuid=with the corresponding libvirt secret UUID
<disk type="network" device="disk">
  <source protocol='rbd' name='libvirt/lxdev01.devops.test'>
    <host name='10.1.1.22' port='6789'/>
  </source>
  <auth username='libvirt'>
    <secret type='ceph' uuid='ec8a08ff-59b2-49fd-8162-532c9b0a2ed6'/>
  </auth>
  <target dev="vda" bus="virtio"/>
</disk>ceph osd map <pool> <image>                                # map image to PGs
virsh qemu-monitor-command --hmp <instance> 'info block'   # configuration of the VM instance 
virsh dumpxml <instance> | xmlstarlet el -v | grep rbd     # get attached RBD imageTest
# work in a disposable path
pushd $(mktemp -d /tmp/$USER-vagrant-XXXXXX)
ssh-keygen -f id_rsa   # ...press enter for an empty password# vi: set ft=ruby :
Vagrant.configure("2") do |config|
  config.vm.provider :libvirt do |libvirt|
    libvirt.memory = 1024
    libvirt.cpus = 1
    libvirt.qemu_use_session = false
  end
  config.vm.provision "file", 
    source: "id_rsa", destination: "/home/vagrant/.ssh/id_rsa"
  public_key = File.read("id_rsa.pub")
  config.vm.provision "shell", privileged: false, inline: <<-SHELL
     mkdir -p /home/vagrant/.ssh
     chmod 700 /home/vagrant/.ssh
     echo '#{public_key}' >> /home/vagrant/.ssh/authorized_keys
     chmod -R 600 /home/vagrant/.ssh/authorized_keys
     echo 'Host 192.168.*.*' >> /home/vagrant/.ssh/config
     echo 'StrictHostKeyChecking no' >> /home/vagrant/.ssh/config
     echo 'UserKnownHostsFile /dev/null' >> /home/vagrant/.ssh/config
     chmod -R 600 /home/vagrant/.ssh/config
  SHELL
  { 
    'adm1' => '172.28.128.10',
    'mon1' => '172.28.128.11',
    'mon2' => '172.28.128.12',
    'mon3' => '172.28.128.13',
    'osd1' => '172.28.128.14',
    'osd2' => '172.28.128.15',
    'osd3' => '172.28.128.16',
    'osd4' => '172.28.128.17',
    'osd5' => '172.28.128.18',
    'osd1' => '172.28.128.19',
    'osd2' => '172.28.128.20'
  }.each_pair do |name,ip|
    config.vm.define "#{name}" do |node|
      node.vm.hostname = name
      node.vm.box = "almalinux/9"
      node.vm.network :private_network, ip: ip
    end
  end
endFootnotes
- Pools, Ceph Documentation 
 https://docs.ceph.com/en/latest/rados/operations/pools↩︎
- Ceph File-System, Ceph Documentation 
 https://docs.ceph.com/en/latest/cephfs↩︎
- FS Volumes, Ceph Documentation 
 https://docs.ceph.com/en/latest/cephfs/fs-volumes/#fs-volumes↩︎
- CephFS Administrative Commands 
 https://docs.ceph.com/en/latest/cephfs/administration↩︎