Ceph — Distributed Fault-tolerant Storage
Overview
Ceph (pronounced /ˈsɛf/) …birds view…
- …open source LGPL …web-site ceph.io
- …provides object, block, and file storage (POSIX)
- …completely distributed operation
- …no single point of failure (highly available)
- …replicates data with fault tolerance
- …data durability …replication, erasure coding, snapshots and clones
- …self-healing and self-managing (minimal administration time/cost)
- …stable, named releases every 12 month…
- …backports for 2 releases …upgrade up to 2 releases at a time
- …complete release history available on Wikipedia
- …18 major releases …since 2012/06
Governance
Governance (state 2023)…
- …public telemetry data …2.5k clusters …>1EB storage
- …Ceph governance model…
- …executive council …elected by steering committee
- …steering committee …20+ members (predominantly Red Hat, SUSE, Intel)
- …leadership team …group effort, shared leadership, open meetings
- …non-profit Ceph Foundation …30+ organisations
- …neutral home for project stakeholders
- …organized as a directed fund under the Linux Foundation
Use-Cases
Storage strategy…
- …store data that serves a particular use case, for example
- …store block-devices for OpenStack
- …object data for an RESTful S3-compliant or Swift-compliant gateway
- …container-based storage for Kubernetes
- …providing a POSIX file-system
- …influences storage media selection…
- …requires trade-off between performance vs. costs
- …faster vs. bigger vs. higher durability
- …for example…
- …SSD-backed OSDs for a high performance pool
- …SAS drive/SSD journal-backed OSDs …high-performance block device volumes and images
- …SATA-backed OSDs for low cost storage
Note: Ceph’s object copies or coding chunks handle durability …make RAID obsolete
Storage strategy requires to assign Ceph OSDs to a CRUSH hierarchy…
- …select appropriate storage media for a use-case …identify appropriate OSDs
- …define CRUSH hierarchy…select storing placement groups
- …define CRUSH rule setting …calculate placement groups (shards)
- …create pools …select replicated or erasure-coded storage …durability
Hardware
Identifying a price-to-performance profile suitable for targeted workload…
- IOPS optimized…
- …block storage …for example database instances as virtual machines on OpenStack
- …lowest cost per IOPS …highest IOPS per GB …99th percentile latency consistency
- …SSD storage (more expensive) …improves IOPS & total throughput
- …15k RPM SAS drives …separate SSD journals (handle frequent writes)
- …typically 3x replication for HDD …2x replication for SSD
- Throughput optimized…
- …fast data storage …block or object storage
- …HDD with acceptable total throughput characteristics
- …lowest cost per MB …highest MB/s per TB/BTU/Watt …97th percentile latency consistency
- …SSD journals (more expensive) …improve write performance
- …typically 3x replication
- Capacity optimized…
- …(inexpensive) big data storage …slower and less expensive SATA drives
- …lowest cost/BTU/Watt per TB
- …erasure coding common for maximizing usable capacity
- …co-locate journals …cheaper then SSD journals
Multiple performance domains can coexist in a Ceph storage cluster
OSD hardware within a pool requires to be identical
- …avoid dissimilar hardware in the same pool!
- Same…
- …controller, drive size, RPMs, seek times, I/O bandwidth
- …network throughput, journal configuration
- …simplifies provisioning …streamlines troubleshooting
Network requires sufficient throughput during backfill and recovery operations…
- …consider storage density when planing for rebalancing after hardware faults
- …prefer that a storage cluster recovers as quickly as possible
- …avoid IOPS starvation due to network latency …be mindful of network link over-subscription
- …segregate intra-cluster traffic from client-to-cluster traffic
- …public network …client traffic …communication to Ceph Monitors
- …storage network …Ceph OSD …heartbeats, replication, backfilling, and recovery traffic
- …(public and storage cluster networks on separate network cards)
- …most performance-related problems in Ceph usually begin with a networking issue
- …jumbo frames for a better CPU per bandwidth ratio
- …non-blocking network switch back-plane
- …same MTU throughout all networking devices (both public and storage networks)
Date Durability
Ceph is fully consistent…
- …data shared across availability zones (AZs), racks, nodes, disks
- …replication configurable to failure domains
- …even in extreme disasters, data can be recovered manually
- Erasure coding support…
- …commonly used for object storage
- …works on block storage & shared file-systems
Architecture
Ceph deployment consists of three types of daemons…
- Ceph OSD Daemon (OSD)
- …stores data …perform data replication, erasure coding
- …rebalancing, recovery, monitoring and reporting functions
- …store all data as objects in a flat namespace (no directory hierarchies)
- …object have cluster-wide unique identifier, binary data, and metadata
- Ceph Monitor (MON) — Master copy of cluster map
- Ceph Manager
- …information about placement groups
- …process metadata and host metadata
- …RESTful monitoring APIs
- Ceph Client — Read/write data from the OSDs
- …retrieve the latest cluster map from MONs
- …computes object placement group and the primary OSD for data access
- …no intermediary server, broker or bus between the client and the OSD
- …clients define the semantics for the data format (i.e. block device)
CRUSH Algorithm
CRUSH (Controlled Replication Under Scalable Hashing)…
- …track the location of storage objects
- …no central authority/lookup table (distributed hashing)
- …efficiently compute information about object location
- …enables massive scale …data location algorithmically computed on clients
CRUSH maps…
- …describes a topography of cluster resources
- …implemented as a simple hierarchy (acyclic graph) and a ruleset
- …the number of placement groups, and the pool interface
- …map exists both on client nodes as well as Ceph Monitor (MON)
- …MONs host master copy of the CRUSH maps
- …MONs not involved in data I/O path
CRUSH Ruleset…
- ..Ceph uses the CRUSH map to implement failure domains…
- …and performance domains for the storage media
- …supports multiple hierarchies to separate one hardware performance profile from another
Pools
Pools1 — Logical partitions …store RADOS objects
- …administrators can create pools for particular types of data
- Pool type
- …erasure coding …data durability method is pool-wide
- …completely transparent to the client
- …contain a defined number of placement groups (PGs)
- …sharding a pool into placement groups
- …with a configured replication level
- Attributes for ownership/access
- Balance number of PGs per pool with the number of PGs per OSD
- 50-200 PGs per OSD to balance out memory and CPU requirements and per-OSD load
- Total Placement
Groups = (OSDs*(50-200))/Number of replica
ceph df # show usage
ceph osd lspools # list pool numbers
ceph osd pool ls detail # list pools with additional information
ceph osd dump | grep ^pool # list replication size
cpeh osd pool create <name> <pgs> <pgs> # create a pool
ceph osd pool get <pool> <key> # read configuration attribute
ceph osd pool set <pool> <key> <value> # set configuration attribute
Keys:
size
number of replica objectsmin_size
minimum number of replica available for IOpg_num
,pgp_num
(effective) number of PGs to use when calculating data placementcrush_ruleset
to use for mapping object placement in the cluster (C
States:
active+clean
– optimum statedegraded
– not enough replicas to meet requirementdown
– no OSD available storing PGinconsistent
– PG nor consistent across different OSDsrepair
– correcting inconsistent PGs to meet replication requirements
Delete a pool
# pool name twice
ceph osd pool delete $pool $pool --yes-i-really-really-mean-it
Monitor Servers (MONs)
Datastore for the health of the entire cluster…
- …hosts master copy of the CRUSH, OSD & MON maps
- …minimum of three MONs for a cluster quorum in production
- …there must be on odd number >=3 to provide consensus…
- …for the use of distributed decision-making with the Paxos algorithm
- …non-quorate pools become unavailable
- OSD map reflect the OSD daemons operating in the cluster
systemctl status ceph-mon@$(hostname) # daemon state
ceph-mon -d -i $(hostname) # run daemon in foreground
/var/log/ceph/ceph-mon.$(hostname).log # default log location
ceph mon stat # state of all monitors
ceph -f json-pretty quorum_status # quorum information
ceph-conf --name mon.$(hostname) --show-config-value log_file
# get location of the log file
ceph --admin-daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status
# access a monitors admin socket
Manipulate the MON map:
monmap=/tmp/monmap.bin
ceph mon getmap -o $monmap # print the MON map if a quorum exists
ceph-mon --extract-monmap $monmap -i $(hostname) # extract the MON map of ceph-mon stopped
monmaptool $monmap --print # print MON map from file
monmaptool $monmap --clobber --rm lxmon03 # remove a monitor
ceph-mon --inject-monmap $monmap -i $(hostname) # store new map
Object Storage Devices (OSDs)
Node hosting physical storage:
ceph-osd
is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network- Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
- OSD daemons are numerically identified:
osd.<id>
, and have following statesin
/out
of the cluster, andup
/down
- Primary OSDs are responsible for replication, coherency, re-balancing and recovery
- Secondery OSDs are under control of a primary OSD, and can become primary
- Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
- Enables atomic updates to objects, and replay of the journal on restart
- Write
ACK
when minimum replica journals written - Journal sync to file-system periodically to reclaim space
/var/log/ceph/*.log # logfiles on storage servers
ceph osd stat # show state
cpeh osd df # storage utilization by OSD
ceph node ls osd # list OSDs
ceph osd dump # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree # OSD/CRUSH map as tree
ceph osd getmaxosd # show number of available OSDs
ceph osd map <pool> <objname> # identify object location
/var/run/ceph/* # admin socket
ceph daemon <socket> <command> # use the admin socket
ceph daemon <socket> status # show identity
Manipulate the CRUSH map:
ceph osd getcrushmap -o /tmp/crushmap.bin # extract teh CRUSH map
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin # inject CRUSH map
Remove an OSD from production:
ceph osd out <num> # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num> # stop the OSD daemon on the server
ceph osd crush remove osd.<num> # remove the OSD from the CRUSH map
ceph auth del osd.<num> # remove authorization credential
ceph osd rm osd.<num> # remove OSD from configuration
# remove from clean ceph.conf if required
Placement Groups (PGs)
Logical collection of objects:
- CRUSH assigns objects to placement groups
- CRUSH assigns a placement group to a primary OSD
- The Primary OSD use CRUSH to replicate the PGs to the secondary OSDs
- Re-balance/recover works on all objects in a placement group
- Placement Group IDs (pgid)
<pool_num>.<pg_id>
(hexadecimal)
ceph pg stat # show state
ceph pg ls-by-osd $num # lost PGs store on OSD
ceph pg map $pgid # map a placement group
ceph pg dump_stuck inactive|unclean|stale|undersized|degraded
# statisitcs for stuck PGs
ceph pg $pgid query # statistics for a particular placement group
ceph pg scrub $pgid # check primary and replicas
osdmap e20 pg 2.f22 (2.22) -> up [0,1,2] acting [0,1,2]
│ │ │ └── acting set of OSDs responsible for a particular PG
│ │ └───────────── matches "acting set" unless rebalancing in progress
│ └──────────────────────────────── placement group number
└──────────────────────────────────── monotonically increasing OSD map version number
Recommended Number of PGs pg_num
in relation to the number of OSDs
OSDs PGs
<5 128
5~10 512
10~50 4096
>50 50-100 per OSD
Object Storage Devices (OSDs)
Node hosting physical storage:
ceph-osd
is the object storage daemon, responsible for storing objects on a local file system and providing access to them over the network- Requires a storage device accessible via a supported file-system (xfs, btrfs) with extended attributes
- OSD daemons are numerically identified:
osd.<id>
, and have following statesin
/out
of the cluster, andup
/down
- Primary OSDs are responsible for replication, coherency, re-balancing and recovery
- Secondery OSDs are under control of a primary OSD, and can become primary
- Small, random I/O written sequentially to local OSD Jounral (improve performance with SSD)
- Enables atomic updates to objects, and replay of the journal on restart
- Write
ACK
when minimum replica journals written - Journal sync to file-system periodically to reclaim space
/var/log/ceph/*.log # logfiles on storage servers
ceph osd stat # show state
cpeh osd df # storage utilization by OSD
ceph node ls osd # list OSDs
ceph osd dump # show OSD map
# OSD ID to host IP, port mapping
ceph osd dump | grep '^osd.[0-9]*' | tr -s ' ' | cut -d' ' -f1,2,14
ceph osd tree # OSD/CRUSH map as tree
ceph osd getmaxosd # show number of available OSDs
ceph osd map <pool> <objname> # identify object location
/var/run/ceph/* # admin socket
ceph daemon <socket> <command> # use the admin socket
ceph daemon <socket> status # show identity
Manipulate the CRUSH map:
ceph osd getcrushmap -o /tmp/crushmap.bin # extract teh CRUSH map
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt # decompile binary CRUSH map
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin # compile CRUSH map
ceph osd setcrushmap -i /tmp/crushmap.bin # inject CRUSH map
Remove an OSD from production:
ceph osd out <num> # remove OSD, begin rebalancing
systemctl stop ceph-osd@<num> # stop the OSD daemon on the server
ceph osd crush remove osd.<num> # remove the OSD from the CRUSH map
ceph auth del osd.<num> # remove authorization credential
ceph osd rm osd.<num> # remove OSD from configuration
# remove from clean ceph.conf if required
Deployment
Recommend deployment method is cephadm
…
- …preferred bare-metal installation and management method
- cephadm-ansible …simplify workflows that are not covered by
cephadm
- Alternatives …manual installation
- …ceph-ansible still maintained …not integrated with the orchestrator APIs
- …Rock for deployment in Kubernetes
- Deploying a new Ceph cluster…
- …requirements …Python 3, Systemd, Podman (or Docker), NTP, LVM2
- …list supported version at latest releases
Packages
RPM packages…
- …download.ceph.com host RPM package for Enterprise Linux
- …Fedora cephadm package
# ...package provided by the Fedora community
dnf -y install cephadm
# ...packages provided by the Ceph community
curl --silent --remote-name --location https://download.ceph.com/rpm-18.2.0/el9/noarch/cephadm
chmod +x cephadm ; ./cephadm add-repo --release reef ; ./cephadm install
…alternatively add the package repository manually…
sudo dnf install -y epel-release
# ...for Enterprise Linux 9
sudo rpm --import 'https://download.ceph.com/keys/release.asc'
sudo dnf install -y https://download.ceph.com/rpm-18.2.0/el9/noarch/ceph-release-1-1.el9.noarch.rpm
dnf makecache ; sudo dnf install -y cephadm
cephadm
Utility that is used to manage a Ceph cluster
- …manages the full lifecycle of a Ceph cluster
- …does not rely on external configuration tools (like Ansible)
- …builds on the orchestrator API
- …uses SSH connections to interact with hosts
- …deploys services using standard container images
- …sequence is started from the command line on the first host of the cluster
Operations
ceph status # summery of state
ceph -s
ceph health detail
ceph -w # watch mode
ceph-deploy --overwrite-conf config push <node> # update the configuration after changes
Authentication
cephx
authentication protocol (…similar to Kerberos)
- …MONs can authenticate users and distribute keys
- …no single point of failure or bottleneck
- …return a sessions key (tickets) to obtain Ceph services
- …Ceph monitors and OSDs share a secret
Administrator must set up users…
# lists authentication state
ceph auth list
# create a new user
ceph auth get-or-create client.<name> <caps>
CephFS
CephFS — POSIX-compliant distributed file-system…
- …with Linux kernel client & FUSE client
- …the RADOS store hosts metadata & data pools
- MDSs (Metadata Server Daemons) — Coordinate clients data access
Operate on the CephFS file-systems2 in your Ceph cluster:
# list all file-system names
ceph fs ls
# destroy a file-system
ceph fs rm $name --yes-i-really-mean-it
Clients
Client are distributed in the ceph-common
package
/etc/ceph/ceph.conf # default location for configuration
/etc/ceph/ceph.client.admin.keyring # default location for admin key
RADOS
- Objects store data and have: a name, the payload (data), and attributes
- Object namespace is flat
rados lspools # list pools
rados df # show pool statistics
rados mkpool <pool> # create a new pool
rados rmpool <pool> # delete a pool
rados ls -p <pool> # list contents of pool
rados -p <pool> put <objname> <file> # store object
rados -p <pool> rm <objname> # delete object
RBD
Block interface on top of RADOS:
- Images (single block devices) striped over multiple RADOS objects/OSDs
- The default pool is called
rbd
mod(probe|info) rbd # kernel module
/sys/bus/rbd/ # live module data
rbd ls [-l] [<pool>] # list block devices
rbd info [<pool>/]<name> # introspec image
rbd du [<name>] # list size of images
rbd create --size <MB> [<pool>/]<name> # create image
rbd map <name> # map image to device
rbd showmapped # list device mappring
rbd unmap <name> #
rbd rm [<pool>/]<name> # remove image
Snapshots
- Read-only copy of the state of an image at a particular point in time
- Layering allows for cloning snapshots
- Stop I/O before taking a snapshot (internal file-systems must be in a consistent state)
- Rolling back overwrites the current version of the image
- It is faster to clone from a snapshot than to rollback to return to an pre-existing state
rbd snap create <pool>/<image>@<name> # snapshot image
rbd snap ls <pool>/<image> # list snapshots of image
rbd snap rollback <pool>/<image>@<name> # rollback to snapshot
rbd snap rm <pool>/<image>@<name> # delete snapshot
rbd snap purge <pool>/<image> # delete all snapshots
rbd snap protect <pool>/<image>@<name> # protect snapshot before cloneing
rbd snap unprotect <pool>/<image>@<name> # revert protection
rbd clone <pool>/<image>@<name> <pool>/<image> # clone snapshot to new image
rbd children <pool>/<image>@<name> # list snapshot decendents
rbd flatten <pool>/<image> # decouple from parent snapshot
Libvirt
Configure a pool…
ceph osd pool create libvirt 128 128 # create a pool for libvirt
ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt'
# create a user for libvirt
virsh secret-define --file secret.xml # define the secret
uuid=$(virsh secret-list | grep client.libvirt | cut -d' ' -f 2)
# uuid of the secret
key=$(ceph auth get-key client.libvirt) # ceph lcient key
virsh secret-set-value --secret "$uuid" --base64 "$key"
# set the UUID of the secret
Libvirt secret.xml
associated with a Ceph RBD
secret ephemeral='no' private='no'>
<usage type='ceph'>
<name>client.libvirt secret</name>
<usage>
</secret> </
Usage
qemu-img create -f raw rbd:<pool>/<image> <size> # create a new image block device
qemu-img info rbd:<pool>/<image> # block device states
qemu-img resize rbd:<pool>/<image> <size> # adjust image size
# create new RBD image from source qcow2 image, and vice versa
qemu-img convert -f qcow2 <path> -O rbd rbd:<pool>/<image>
qemu-img convert -f rbd rbd:<pool>/<image> -O qcow2 <path>
RBD image as <disk>
entry for a virtual machine; replace:
source/name=
with<pool>/<image>
host/name=
with the IP address of a monitor server (multiples supported)secret/uuid=
with the corresponding libvirt secret UUID
disk type="network" device="disk">
<source protocol='rbd' name='libvirt/lxdev01.devops.test'>
<host name='10.1.1.22' port='6789'/>
<source>
</auth username='libvirt'>
<secret type='ceph' uuid='ec8a08ff-59b2-49fd-8162-532c9b0a2ed6'/>
<auth>
</target dev="vda" bus="virtio"/>
<disk> </
ceph osd map <pool> <image> # map image to PGs
virsh qemu-monitor-command --hmp <instance> 'info block' # configuration of the VM instance
virsh dumpxml <instance> | xmlstarlet el -v | grep rbd # get attached RBD image
Test
# work in a disposable path
pushd $(mktemp -d /tmp/$USER-vagrant-XXXXXX)
ssh-keygen -f id_rsa # ...press enter for an empty password
# vi: set ft=ruby :
Vagrant.configure("2") do |config|
.vm.provider :libvirt do |libvirt|
config.memory = 1024
libvirt.cpus = 1
libvirt.qemu_use_session = false
libvirtend
.vm.provision "file",
configsource: "id_rsa", destination: "/home/vagrant/.ssh/id_rsa"
= File.read("id_rsa.pub")
public_key
.vm.provision "shell", privileged: false, inline: <<-SHELL
config mkdir -p /home/vagrant/.ssh
chmod 700 /home/vagrant/.ssh
echo '#{public_key}' >> /home/vagrant/.ssh/authorized_keys
chmod -R 600 /home/vagrant/.ssh/authorized_keys
echo 'Host 192.168.*.*' >> /home/vagrant/.ssh/config
echo 'StrictHostKeyChecking no' >> /home/vagrant/.ssh/config
echo 'UserKnownHostsFile /dev/null' >> /home/vagrant/.ssh/config
chmod -R 600 /home/vagrant/.ssh/config
SHELL
{
'adm1' => '172.28.128.10',
'mon1' => '172.28.128.11',
'mon2' => '172.28.128.12',
'mon3' => '172.28.128.13',
'osd1' => '172.28.128.14',
'osd2' => '172.28.128.15',
'osd3' => '172.28.128.16',
'osd4' => '172.28.128.17',
'osd5' => '172.28.128.18',
'osd1' => '172.28.128.19',
'osd2' => '172.28.128.20'
}.each_pair do |name,ip|
.vm.define "#{name}" do |node|
config.vm.hostname = name
node.vm.box = "almalinux/9"
node.vm.network :private_network, ip: ip
nodeend
end
end
Footnotes
Pools, Ceph Documentation
https://docs.ceph.com/en/latest/rados/operations/pools↩︎CephFS Administrative Commands
https://docs.ceph.com/en/latest/cephfs/administration↩︎