Container Orchestration

Containers

Published

March 12, 2018

Modified

January 4, 2024

Automated arrangement, coordination, and management of software containers

Containerized workloads: portable, insulated, independent dependencies

COE - Container Orchestration Engines, aka. Container as a Service (CaaS)
WMS - HPC Workload Management Systems with container support

Open-Source for container orchestration:

Name	Family	Cf.
PBS (Portable Batch System)	WMS	http://pbspro.org
HT (High Throughput) Condor	WMS	http://research.cs.wisc.edu/htcondor
Slurm	WMS	https://slurm.schedmd.com
Mesos¹ (with Marathon)	COE	http://mesos.apache.org
Kubernetes	COE	https://kubernetes.io
Docker Swarm	COE	https://docs.docker.com/engine/swarm

¹Mesos is a common resource management system hosting multiple distributed computing (workload management) frameworks (2-level scheduling).

Slurm and/or/vs Kubernetes, Tim Wickberg, SchedMD, 2023/11
https://slurm.schedmd.com/SC23/Slurm-and-or-vs-Kubernetes.pdf

User Workloads

Users Submit a wide variety of computational applications (jobs, tasks) for processing

Different parallel execution paradigms (levels of parallelism)
Different execution requirements (i.e. run-time, required hardware (CPUs, RAM,etc.))

Workload	Description
service	Service (daemon) process (long running, persistent execution)
periodical	Processes executed in a defined interval
batch	Single (independent) processes (sequential execution)
array	Pleasantly parallel processes (asynchronously executed)
parallel	Synchronously parallel processes (simultaneous execution)
analytics	Combination of the above categories

Comparing workloads

Traditional HPC - Strict resource constrains, highly parallel, performance oriented (fast storage, low latency interconnect)
Data analytics - Malleable requirements, load-balanced, fault-tolerant (replication to avoid data loss)

Workload Execution

Match/execute user workloads efficently on available computing resources:

Classification of workload schedulers:

Type x	Description
monolithic	Single central scheduling for all tasks (i.e. all HPC WMS, Kubernetes, Swarm)
two-level	Separate resources allocation from task placement (i.e. Mesos)
shared state	Shared state (protocol) for multiple (application specific) schedulers (no examples)

Many possible resource metrics and tunable settings for the scheduling algorithm

Resource - amount, type, duration, cost/power, topology…
Account - user, group, project, tags, limits…
Policies - job size/age, priority, fair share, reservation,…

Methods to optimize mapping of requested resources to the available resources:

Method	Description
time sharing	Allocate multiple workloads from one or more user on a single node
backfilling	Schedule pending workloads out of order to maximize utilization
job chunking	Run similar workloads of multiple users simultaneously
bin packing	Group workloads of multiple user to optimize utilization
gang scheduling	Allow users to submit multiple process within a single workload
job dependencies	Allow user to define workflows for execution (direct acyclic graphs (DAGs))

User defined factors contributing to scheduling decisions:

Factor	Description
fair-share	Uses historic resource utilization as a factor to balance target shares
quality of service	Special treatment of jobs based on specified criteria
reservation	All to reserve resources in advance
job dependencies	Task workflows/sequencing defined by users

Multi-factor priority is the use of multiple factors to determine the execution of workloads.

Workload Management

Responsibilities/components of a workload management system:

Component	Description
scheduling	Allocate resources and assign a workload from the queue
lifecycle management	Receive and queue workloads, prioritize/sort candidate workloads
resource management	Collect resource capabilities and state information for the scheduler
execution/monitoring	Launch the workload, track its state and collect performance metrics

Resource Management

Typically involves following capabilities:

Capability	Description
heterogeneous resources	The scheduler can accommodate different hardware configurations
allocation policy	Prioritization of jobs according to specific resource requirements (i.e. run-time)
consumable resources	Enforced static (CPUs, RAM) and dynamic (load, bandwidth) resource constrains
network-aware scheduling	Consideration of network topology by the scheduler for job allocation
accelerator support	Specialized hardware GPUs, FPGAs, PHIs (Intel)
power capping	Limit the power consumption of workloads

At minimum resources are managed at the granularity of a node
Typically fine-grained management of CPU cores & memory
Eventually includes software licenses, network bandwidth, shared storage space, accelerators

Workload placement

Usually highly configurable:

Replacement/reordering - Dynamically manged job order in the queue to react to state changes (resource, jobs, etc)
Power-aware scheduling - Drain, shutdown unused nodes to minimize power consumption
Prolog/epilog support - Allow scripts to be executed before/after a job
Data staging - Support for copying files to local storage before job execution
Checkpointing - Allow applications to save execution state in intervals (fault-tolerance)
Job migration - Move jobs during execution to mitigate failure states (rebalancing of long-running or service-oriented jobs)
Job restarting - Automatic restart of aborted of failed jobs
Job preemption - Suspend low-priority jobs to free resources of required

--- title: Container Orchestration categories: - Containers date: 2018/03/12 date-modified: 2024/01/04 --- _Automated arrangement, coordination, and management of software containers_ Containerized workloads: portable, insulated, independent dependencies * **COE** - Container Orchestration Engines, aka. Container as a Service (CaaS) * **WMS** - HPC Workload Management Systems with container support Open-Source for container orchestration: Name | Family | Cf. ------------------------------------|--------|-------------------------------- PBS (Portable Batch System) | WMS | <http://pbspro.org> HT (High Throughput) Condor | WMS | <http://research.cs.wisc.edu/htcondor> Slurm | WMS | <https://slurm.schedmd.com> Mesos¹ (with Marathon) | COE | <http://mesos.apache.org> Kubernetes | COE | <https://kubernetes.io> Docker Swarm | COE | <https://docs.docker.com/engine/swarm> ¹Mesos is a common resource management system hosting multiple distributed computing (workload management) frameworks (2-level scheduling). Slurm and/or/vs Kubernetes, Tim Wickberg, SchedMD, 2023/11 <https://slurm.schedmd.com/SC23/Slurm-and-or-vs-Kubernetes.pdf> ## User Workloads Users Submit a wide variety of **computational applications** (jobs, tasks) for processing * Different parallel **execution paradigms** (levels of parallelism) * Different **execution requirements** (i.e. run-time, required hardware (CPUs, RAM,etc.)) Workload | Description --------------|------------------------------------ service | Service (daemon) process (long running, persistent execution) periodical | Processes executed in a defined interval batch | Single (independent) processes (sequential execution) array | Pleasantly parallel processes (asynchronously executed) parallel | Synchronously parallel processes (simultaneous execution) analytics | Combination of the above categories Comparing workloads - Traditional HPC - Strict resource constrains, highly parallel, performance oriented (fast storage, low latency interconnect) - Data analytics - Malleable requirements, load-balanced, fault-tolerant (replication to avoid data loss) ## Workload Execution **Match/execute** user workloads **efficently** on available computing resources: Classification of workload schedulers: Type x | Description -------------|-------------------------------- monolithic | Single central scheduling for all tasks (i.e. all HPC WMS, Kubernetes, Swarm) two-level | Separate resources allocation from task placement (i.e. Mesos) shared state | Shared state (protocol) for multiple (application specific) schedulers (no examples) Many possible **resource metrics** and **tunable settings** for the scheduling algorithm * Resource - amount, type, duration, cost/power, topology... * Account - user, group, project, tags, limits... * Policies - job size/age, priority, fair share, reservation,... Methods to **optimize mapping** of requested resources to the available resources: Method | Description ------------------|------------------------------------------------- time sharing | Allocate multiple workloads from one or more user on a single node backfilling | Schedule pending workloads out of order to maximize utilization job chunking | Run similar workloads of multiple users simultaneously bin packing | Group workloads of multiple user to optimize utilization gang scheduling | Allow users to submit multiple process within a single workload job dependencies | Allow user to define workflows for execution (direct acyclic graphs (DAGs)) **User defined factors** contributing to scheduling decisions: Factor | Description -------------------|--------------------------------------------------- fair-share | Uses historic resource utilization as a factor to balance target shares quality of service | Special treatment of jobs based on specified criteria reservation | All to reserve resources in advance job dependencies | Task workflows/sequencing defined by users **Multi-factor priority** is the use of multiple factors to determine the execution of workloads. ## Workload Management Responsibilities/components of a workload management system: Component | Description ---------------------|---------------------------------------- scheduling | Allocate resources and assign a workload from the queue lifecycle management | Receive and queue workloads, prioritize/sort candidate workloads resource management | Collect resource capabilities and state information for the scheduler execution/monitoring | Launch the workload, track its state and collect performance metrics ### Resource Management Typically involves following capabilities: Capability | Description -------------------------|------------------------------------------- heterogeneous resources | The scheduler can accommodate different hardware configurations allocation policy | Prioritization of jobs according to specific resource requirements (i.e. run-time) consumable resources | Enforced static (CPUs, RAM) and dynamic (load, bandwidth) resource constrains network-aware scheduling | Consideration of network topology by the scheduler for job allocation accelerator support | Specialized hardware GPUs, FPGAs, PHIs (Intel) power capping | Limit the power consumption of workloads * At minimum resources are managed at the granularity of a node * Typically fine-grained management of CPU cores & memory * Eventually includes software licenses, network bandwidth, shared storage space, accelerators ### Workload placement Usually highly configurable: * Replacement/reordering - Dynamically manged job order in the queue to react to state changes (resource, jobs, etc) * Power-aware scheduling - Drain, shutdown unused nodes to minimize power consumption * Prolog/epilog support - Allow scripts to be executed before/after a job * Data staging - Support for copying files to local storage before job execution * Checkpointing - Allow applications to save execution state in intervals (fault-tolerance) * Job migration - Move jobs during execution to mitigate failure states (rebalancing of long-running or service-oriented jobs) * Job restarting - Automatic restart of aborted of failed jobs * Job preemption - Suspend low-priority jobs to free resources of required