Ceph — Distributed Fault-tolerant Storage

Storage
Published

July 10, 2015

Modified

April 11, 2025

Overview

Ceph (pronounced /ˈsɛf/) …birds view…

  • …open source LGPL …web-site ceph.io
  • …provides object, block, and file storage (POSIX)
  • …completely distributed operation
    • …no single point of failure (highly available)
    • …replicates data with fault tolerance
    • …data durability …replication, erasure coding, snapshots and clones
    • …self-healing and self-managing (minimal administration time/cost)
  • …stable, named releases every 12 month…
    • …backports for 2 releases …upgrade up to 2 releases at a time
    • …complete release history available on Wikipedia
    • …18 major releases …since 2012/06

Governance

Governance (state 2023)…

  • …public telemetry data …2.5k clusters …>1EB storage
  • …Ceph governance model
    • …executive council …elected by steering committee
    • …steering committee …20+ members (predominantly Red Hat, SUSE, Intel)
    • …leadership team …group effort, shared leadership, open meetings
  • …non-profit Ceph Foundation …30+ organisations
    • …neutral home for project stakeholders
    • …organized as a directed fund under the Linux Foundation

Use-Cases

Storage strategy…

  • …store data that serves a particular use case, for example
    • …store block-devices for OpenStack
    • …object data for an RESTful S3-compliant or Swift-compliant gateway
    • …container-based storage for Kubernetes
    • …providing a POSIX file-system
  • …influences storage media selection
    • …requires trade-off between performance vs. costs
    • …faster vs. bigger vs. higher durability
    • …for example…
      • …SSD-backed OSDs for a high performance pool
      • …SAS drive/SSD journal-backed OSDs …high-performance block device volumes and images
      • …SATA-backed OSDs for low cost storage

Note: Ceph’s object copies or coding chunks handle durability …make RAID obsolete

Storage strategy requires to assign Ceph OSDs to a CRUSH hierarchy

  • …select appropriate storage media for a use-case …identify appropriate OSDs
  • …define CRUSH hierarchy…select storing placement groups
  • …define CRUSH rule setting …calculate placement groups (shards)
  • …create pools …select replicated or erasure-coded storage …durability

Hardware

Identifying a price-to-performance profile suitable for targeted workload…

  • IOPS optimized
    • …block storage …for example database instances as virtual machines on OpenStack
    • …lowest cost per IOPS …highest IOPS per GB …99th percentile latency consistency
    • …SSD storage (more expensive) …improves IOPS & total throughput
    • …15k RPM SAS drives …separate SSD journals (handle frequent writes)
    • …typically 3x replication for HDD …2x replication for SSD
  • Throughput optimized
    • …fast data storage …block or object storage
    • …HDD with acceptable total throughput characteristics
    • …lowest cost per MB …highest MB/s per TB/BTU/Watt …97th percentile latency consistency
    • …SSD journals (more expensive) …improve write performance
    • …typically 3x replication
  • Capacity optimized
    • …(inexpensive) big data storage …slower and less expensive SATA drives
    • …lowest cost/BTU/Watt per TB
    • …erasure coding common for maximizing usable capacity
    • …co-locate journals …cheaper then SSD journals

Multiple performance domains can coexist in a Ceph storage cluster

OSD hardware within a pool requires to be identical

  • …avoid dissimilar hardware in the same pool!
  • Same…
    • …controller, drive size, RPMs, seek times, I/O bandwidth
    • …network throughput, journal configuration
  • …simplifies provisioning …streamlines troubleshooting

Network requires sufficient throughput during backfill and recovery operations

  • …consider storage density when planing for rebalancing after hardware faults
  • …prefer that a storage cluster recovers as quickly as possible
  • …avoid IOPS starvation due to network latency …be mindful of network link over-subscription
  • …segregate intra-cluster traffic from client-to-cluster traffic
    • …public network …client traffic …communication to Ceph Monitors
    • …storage network …Ceph OSD …heartbeats, replication, backfilling, and recovery traffic
    • …(public and storage cluster networks on separate network cards)
  • …most performance-related problems in Ceph usually begin with a networking issue
    • …jumbo frames for a better CPU per bandwidth ratio
    • …non-blocking network switch back-plane
    • …same MTU throughout all networking devices (both public and storage networks)

Date Durability

Ceph is fully consistent…

  • …data shared across availability zones (AZs), racks, nodes, disks
  • …replication configurable to failure domains
  • …even in extreme disasters, data can be recovered manually
  • Erasure coding support…
    • …commonly used for object storage
    • …works on block storage & shared file-systems

Architecture

Ceph deployment consists of three types of daemons…

  • Ceph OSD Daemon (OSD)
    • …stores data …perform data replication, erasure coding
    • …rebalancing, recovery, monitoring and reporting functions
    • …store all data as objects in a flat namespace (no directory hierarchies)
    • …object have cluster-wide unique identifier, binary data, and metadata
  • Ceph Monitor (MON) — Master copy of cluster map
  • Ceph Manager
    • …information about placement groups
    • …process metadata and host metadata
    • …RESTful monitoring APIs
  • Ceph Client — Read/write data from the OSDs
    • …retrieve the latest cluster map from MONs
    • …computes object placement group and the primary OSD for data access
    • …no intermediary server, broker or bus between the client and the OSD
    • …clients define the semantics for the data format (i.e. block device)

CRUSH Algorithm

CRUSH (Controlled Replication Under Scalable Hashing)

  • …track the location of storage objects
  • …no central authority/lookup table (distributed hashing)
  • …efficiently compute information about object location
  • …enables massive scale …data location algorithmically computed on clients

CRUSH maps

  • …describes a topography of cluster resources
  • …implemented as a simple hierarchy (acyclic graph) and a ruleset
  • …the number of placement groups, and the pool interface
  • …map exists both on client nodes as well as Ceph Monitor (MON)
    • …MONs host master copy of the CRUSH maps
    • …MONs not involved in data I/O path

CRUSH Ruleset

  • ..Ceph uses the CRUSH map to implement failure domains
    • …and performance domains for the storage media
    • …supports multiple hierarchies to separate one hardware performance profile from another

Pools

Pools1 — Logical partitions …store RADOS objects

  • …administrators can create pools for particular types of data
  • Pool type
    • …erasure coding …data durability method is pool-wide
    • …completely transparent to the client
  • …contain a defined number of placement groups (PGs)
    • …sharding a pool into placement groups
    • …with a configured replication level
  • Attributes for ownership/access
  • Balance number of PGs per pool with the number of PGs per OSD
    • 50-200 PGs per OSD to balance out memory and CPU requirements and per-OSD load
    • Total Placement Groups = (OSDs*(50-200))/Number of replica
ceph df                                              # show usage
ceph osd lspools                                     # list pool numbers
ceph osd pool ls detail                              # list pools with additional information
ceph osd dump | grep ^pool                           # list replication size
cpeh osd pool create <name> <pgs> <pgs>              # create a pool 
ceph osd pool get <pool> <key>                       # read configuration attribute
ceph osd pool set <pool> <key> <value>               # set configuration attribute

Keys:

  • size number of replica objects
  • min_size minimum number of replica available for IO
  • pg_num,pgp_num (effective) number of PGs to use when calculating data placement
  • crush_ruleset to use for mapping object placement in the cluster (C

States:

  • active+clean – optimum state
  • degraded – not enough replicas to meet requirement
  • down – no OSD available storing PG
  • inconsistent – PG nor consistent across different OSDs
  • repair – correcting inconsistent PGs to meet replication requirements

Delete a pool

# pool name twice
ceph osd pool delete $pool $pool --yes-i-really-really-mean-it

Monitor Servers (MONs)

Datastore for the health of the entire cluster…

  • …hosts master copy of the CRUSH, OSD & MON maps
  • minimum of three MONs for a cluster quorum in production
    • …there must be on odd number >=3 to provide consensus…
    • …for the use of distributed decision-making with the Paxos algorithm
    • …non-quorate pools become unavailable
  • OSD map reflect the OSD daemons operating in the cluster
systemctl status ceph-mon@$(hostname)                 # daemon state       
ceph-mon -d -i $(hostname)                            # run daemon in foreground
/var/log/ceph/ceph-mon.$(hostname).log                # default log location
ceph mon stat                                         # state of all monitors
ceph -f json-pretty quorum_status                     # quorum information
ceph-conf --name mon.$(hostname) --show-config-value log_file
                                                      # get location of the log file
ceph --admin-daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status
                                                      # access a monitors admin socket

Manipulate the MON map:

monmap=/tmp/monmap.bin
ceph mon getmap -o $monmap                            # print the MON map if a quorum exists
ceph-mon --extract-monmap $monmap -i $(hostname)      # extract the MON map of ceph-mon stopped
monmaptool $monmap --print                            # print MON map from file
monmaptool $monmap --clobber --rm lxmon03             # remove a monitor
ceph-mon --inject-monmap $monmap -i $(hostname)       # store new map

Object Storage Devices (OSDs)

Node hosting physical storage:

  • ceph-osd is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network
  • Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
  • OSD daemons are numerically identified: osd.<id>, and have following states in/out of the cluster, and up/down
  • Primary OSDs are responsible for replication, coherency, re-balancing and recovery
  • Secondery OSDs are under control of a primary OSD, and can become primary
  • Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
    • Enables atomic updates to objects, and replay of the journal on restart
    • Write ACK when minimum replica journals written
    • Journal sync to file-system periodically to reclaim space
/var/log/ceph/*.log                                  # logfiles on storage servers    
ceph osd stat                                        # show state 
cpeh osd df                                          # storage utilization by OSD
ceph node ls osd                                     # list OSDs
ceph osd dump                                        # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree                                        # OSD/CRUSH map as tree
ceph osd getmaxosd                                   # show number of available OSDs
ceph osd map <pool> <objname>                        # identify object location
/var/run/ceph/*                                      # admin socket
ceph daemon <socket> <command>                       # use the admin socket
ceph daemon <socket> status                          # show identity

Manipulate the CRUSH map:

ceph osd getcrushmap -o /tmp/crushmap.bin                    # extract teh CRUSH map 
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt          # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin          # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin                    # inject CRUSH map

Remove an OSD from production:

ceph osd out <num>                                   # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num>                        # stop the OSD daemon on the server 
ceph osd crush remove osd.<num>                      # remove the OSD from the CRUSH map
ceph auth del osd.<num>                              # remove authorization credential
ceph osd rm osd.<num>                                # remove OSD from configuration
# remove from clean ceph.conf if required 

Placement Groups (PGs)

Logical collection of objects:

  • CRUSH assigns objects to placement groups
  • CRUSH assigns a placement group to a primary OSD
  • The Primary OSD use CRUSH to replicate the PGs to the secondary OSDs
  • Re-balance/recover works on all objects in a placement group
  • Placement Group IDs (pgid) <pool_num>.<pg_id> (hexadecimal)
ceph pg stat                                         # show state
ceph pg ls-by-osd $num                               # lost PGs store on OSD
ceph pg map $pgid                                    # map a placement group
ceph pg dump_stuck inactive|unclean|stale|undersized|degraded
                                                     # statisitcs for stuck PGs
ceph pg $pgid query                                  # statistics for a particular placement group
ceph pg scrub $pgid                                  # check primary and replicas
osdmap e20 pg 2.f22 (2.22) -> up [0,1,2] acting [0,1,2]
          │                  │          └── acting set of OSDs responsible for a particular PG
          │                  └───────────── matches "acting set" unless rebalancing in progress 
          └──────────────────────────────── placement group number                   
       └──────────────────────────────────── monotonically increasing OSD map version number

Recommended Number of PGs pg_num in relation to the number of OSDs

OSDs       PGs
<5         128
5~10       512
10~50      4096
>50        50-100 per OSD

Object Storage Devices (OSDs)

Node hosting physical storage:

  • ceph-osd is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network
  • Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
  • OSD daemons are numerically identified: osd.<id>, and have following states in/out of the cluster, and up/down
  • Primary OSDs are responsible for replication, coherency, re-balancing and recovery
  • Secondery OSDs are under control of a primary OSD, and can become primary
  • Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
    • Enables atomic updates to objects, and replay of the journal on restart
    • Write ACK when minimum replica journals written
    • Journal sync to file-system periodically to reclaim space
/var/log/ceph/*.log                                  # logfiles on storage servers    
ceph osd stat                                        # show state 
cpeh osd df                                          # storage utilization by OSD
ceph node ls osd                                     # list OSDs
ceph osd dump                                        # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree                                        # OSD/CRUSH map as tree
ceph osd getmaxosd                                   # show number of available OSDs
ceph osd map <pool> <objname>                        # identify object location
/var/run/ceph/*                                      # admin socket
ceph daemon <socket> <command>                       # use the admin socket
ceph daemon <socket> status                          # show identity

Manipulate the CRUSH map:

ceph osd getcrushmap -o /tmp/crushmap.bin                    # extract teh CRUSH map 
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt          # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin          # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin                    # inject CRUSH map

Remove an OSD from production:

ceph osd out <num>                                   # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num>                        # stop the OSD daemon on the server 
ceph osd crush remove osd.<num>                      # remove the OSD from the CRUSH map
ceph auth del osd.<num>                              # remove authorization credential
ceph osd rm osd.<num>                                # remove OSD from configuration
# remove from clean ceph.conf if required 

Deployment

Recommend deployment method is cephadm

Packages

RPM packages…

# ...package provided by the Fedora community
dnf -y install cephadm

# ...packages provided by the Ceph community
curl --silent --remote-name --location https://download.ceph.com/rpm-18.2.0/el9/noarch/cephadm
chmod +x cephadm ; ./cephadm add-repo --release reef ; ./cephadm install

…alternatively add the package repository manually

sudo dnf install -y epel-release
# ...for Enterprise Linux 9
sudo rpm --import 'https://download.ceph.com/keys/release.asc'
sudo dnf install -y https://download.ceph.com/rpm-18.2.0/el9/noarch/ceph-release-1-1.el9.noarch.rpm
dnf makecache ; sudo dnf install -y cephadm

cephadm

Utility that is used to manage a Ceph cluster

  • …manages the full lifecycle of a Ceph cluster
  • …does not rely on external configuration tools (like Ansible)
  • builds on the orchestrator API
  • …uses SSH connections to interact with hosts
  • …deploys services using standard container images
  • …sequence is started from the command line on the first host of the cluster

Operations

ceph status                                          # summery of state
ceph -s
ceph health detail
ceph -w                                              # watch mode
ceph-deploy --overwrite-conf config push <node>      # update the configuration after changes

Authentication

cephx authentication protocol (…similar to Kerberos)

  • …MONs can authenticate users and distribute keys
    • …no single point of failure or bottleneck
    • …return a sessions key (tickets) to obtain Ceph services
  • …Ceph monitors and OSDs share a secret

Administrator must set up users…

# lists authentication state
ceph auth list

# create a new user 
ceph auth get-or-create client.<name> <caps>

CephFS

CephFS — POSIX-compliant distributed file-system…

  • …with Linux kernel client & FUSE client
  • …the RADOS store hosts metadata & data pools
  • MDSs (Metadata Server Daemons) — Coordinate clients data access

Operate on the CephFS file-systems2 in your Ceph cluster:

# list all file-system names
ceph fs ls

# destroy a file-system
ceph fs rm $name --yes-i-really-mean-it

Clients

Client are distributed in the ceph-common package

/etc/ceph/ceph.conf                                  # default location for configuration
/etc/ceph/ceph.client.admin.keyring                  # default location for admin key 

RADOS

  • Objects store data and have: a name, the payload (data), and attributes
  • Object namespace is flat
rados lspools                                        # list pools
rados df                                             # show pool statistics
rados mkpool <pool>                                  # create a new pool     
rados rmpool <pool>                                  # delete a pool
rados ls -p <pool>                                   # list contents of pool
rados -p <pool> put <objname> <file>                 # store object
rados -p <pool> rm <objname>                         # delete object

RBD

Block interface on top of RADOS:

  • Images (single block devices) striped over multiple RADOS objects/OSDs
  • The default pool is called rbd
mod(probe|info) rbd                                  # kernel module
/sys/bus/rbd/                                        # live module data
rbd ls [-l] [<pool>]                                 # list block devices
rbd info [<pool>/]<name>                             # introspec image
rbd du [<name>]                                      # list size of images
rbd create --size <MB> [<pool>/]<name>               # create image
rbd map <name>                                       # map image to device
rbd showmapped                                       # list device mappring
rbd unmap <name>                                     # 
rbd rm [<pool>/]<name>                               # remove image

Snapshots

  • Read-only copy of the state of an image at a particular point in time
  • Layering allows for cloning snapshots
  • Stop I/O before taking a snapshot (internal file-systems must be in a consistent state)
  • Rolling back overwrites the current version of the image
  • It is faster to clone from a snapshot than to rollback to return to an pre-existing state
rbd snap create <pool>/<image>@<name>                # snapshot image
rbd snap ls <pool>/<image>                           # list snapshots of image
rbd snap rollback <pool>/<image>@<name>              # rollback to snapshot
rbd snap rm <pool>/<image>@<name>                    # delete snapshot
rbd snap purge <pool>/<image>                        # delete all snapshots
rbd snap protect <pool>/<image>@<name>               # protect snapshot before cloneing
rbd snap unprotect <pool>/<image>@<name>             # revert protection
rbd clone <pool>/<image>@<name> <pool>/<image>       # clone snapshot to new image
rbd children <pool>/<image>@<name>                   # list snapshot decendents
rbd flatten <pool>/<image>                           # decouple from parent snapshot

Libvirt

Configure a pool…

ceph osd pool create libvirt 128 128                 # create a pool for libvirt
ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt'
                                                     # create a user for libvirt
virsh secret-define --file secret.xml                # define the secret
uuid=$(virsh secret-list | grep client.libvirt | cut -d' ' -f 2)
                                                     # uuid of the secret
key=$(ceph auth get-key client.libvirt)              # ceph lcient key
virsh secret-set-value --secret "$uuid" --base64 "$key" 
                                                     # set the UUID of the secret

Libvirt secret.xml associated with a Ceph RBD

<secret ephemeral='no' private='no'>
  <usage type='ceph'>
    <name>client.libvirt secret</name>
  </usage>
</secret>

Usage

qemu-img create -f raw rbd:<pool>/<image> <size>     # create a new image block device
qemu-img info rbd:<pool>/<image>                     # block device states
qemu-img resize rbd:<pool>/<image> <size>            # adjust image size
# create new RBD image from source qcow2 image, and vice versa
qemu-img convert -f qcow2 <path> -O rbd rbd:<pool>/<image>
qemu-img convert -f rbd rbd:<pool>/<image> -O qcow2 <path>

RBD image as <disk> entry for a virtual machine; replace:

  • source/name= with <pool>/<image>
  • host/name= with the IP address of a monitor server (multiples supported)
  • secret/uuid= with the corresponding libvirt secret UUID
<disk type="network" device="disk">
  <source protocol='rbd' name='libvirt/lxdev01.devops.test'>
    <host name='10.1.1.22' port='6789'/>
  </source>
  <auth username='libvirt'>
    <secret type='ceph' uuid='ec8a08ff-59b2-49fd-8162-532c9b0a2ed6'/>
  </auth>
  <target dev="vda" bus="virtio"/>
</disk>
ceph osd map <pool> <image>                                # map image to PGs
virsh qemu-monitor-command --hmp <instance> 'info block'   # configuration of the VM instance 
virsh dumpxml <instance> | xmlstarlet el -v | grep rbd     # get attached RBD image

Test

# work in a disposable path
pushd $(mktemp -d /tmp/$USER-vagrant-XXXXXX)
ssh-keygen -f id_rsa   # ...press enter for an empty password
# vi: set ft=ruby :
Vagrant.configure("2") do |config|

  config.vm.provider  do |libvirt|
    libvirt.memory = 1024
    libvirt.cpus = 1
    libvirt.qemu_use_session = false
  end

  config.vm.provision "file", 
    "id_rsa", "/home/vagrant/.ssh/id_rsa"
  public_key = File.read("id_rsa.pub")

  config.vm.provision "shell", false, <<-SHELL
     mkdir -p /home/vagrant/.ssh
     chmod 700 /home/vagrant/.ssh
     echo '#{public_key}' >> /home/vagrant/.ssh/authorized_keys
     chmod -R 600 /home/vagrant/.ssh/authorized_keys
     echo 'Host 192.168.*.*' >> /home/vagrant/.ssh/config
     echo 'StrictHostKeyChecking no' >> /home/vagrant/.ssh/config
     echo 'UserKnownHostsFile /dev/null' >> /home/vagrant/.ssh/config
     chmod -R 600 /home/vagrant/.ssh/config
  SHELL

  { 
    'adm1' => '172.28.128.10',
    'mon1' => '172.28.128.11',
    'mon2' => '172.28.128.12',
    'mon3' => '172.28.128.13',
    'osd1' => '172.28.128.14',
    'osd2' => '172.28.128.15',
    'osd3' => '172.28.128.16',
    'osd4' => '172.28.128.17',
    'osd5' => '172.28.128.18',
    'osd1' => '172.28.128.19',
    'osd2' => '172.28.128.20'
  }.each_pair do |name,ip|
    config.vm.define "#{name}" do |node|
      node.vm.hostname = name
      node.vm.box = "almalinux/9"
      node.vm.network , ip
    end
  end

end