Container Orchestration
Automated arrangement, coordination, and management of software containers
Containerized workloads: portable, insulated, independent dependencies
- COE - Container Orchestration Engines, aka. Container as a Service (CaaS)
- WMS - HPC Workload Management Systems with container support
Open-Source for container orchestration:
| Name | Family | Cf. | 
|---|---|---|
| PBS (Portable Batch System) | WMS | http://pbspro.org | 
| HT (High Throughput) Condor | WMS | http://research.cs.wisc.edu/htcondor | 
| Slurm | WMS | https://slurm.schedmd.com | 
| Mesos¹ (with Marathon) | COE | http://mesos.apache.org | 
| Kubernetes | COE | https://kubernetes.io | 
| Docker Swarm | COE | https://docs.docker.com/engine/swarm | 
¹Mesos is a common resource management system hosting multiple distributed computing (workload management) frameworks (2-level scheduling).
Slurm and/or/vs Kubernetes, Tim Wickberg, SchedMD, 2023/11
https://slurm.schedmd.com/SC23/Slurm-and-or-vs-Kubernetes.pdf
User Workloads
Users Submit a wide variety of computational applications (jobs, tasks) for processing
- Different parallel execution paradigms (levels of parallelism)
- Different execution requirements (i.e. run-time, required hardware (CPUs, RAM,etc.))
| Workload | Description | 
|---|---|
| service | Service (daemon) process (long running, persistent execution) | 
| periodical | Processes executed in a defined interval | 
| batch | Single (independent) processes (sequential execution) | 
| array | Pleasantly parallel processes (asynchronously executed) | 
| parallel | Synchronously parallel processes (simultaneous execution) | 
| analytics | Combination of the above categories | 
Comparing workloads
- Traditional HPC - Strict resource constrains, highly parallel, performance oriented (fast storage, low latency interconnect)
- Data analytics - Malleable requirements, load-balanced, fault-tolerant (replication to avoid data loss)
Workload Execution
Match/execute user workloads efficently on available computing resources:
Classification of workload schedulers:
| Type x | Description | 
|---|---|
| monolithic | Single central scheduling for all tasks (i.e. all HPC WMS, Kubernetes, Swarm) | 
| two-level | Separate resources allocation from task placement (i.e. Mesos) | 
| shared state | Shared state (protocol) for multiple (application specific) schedulers (no examples) | 
Many possible resource metrics and tunable settings for the scheduling algorithm
- Resource - amount, type, duration, cost/power, topology…
- Account - user, group, project, tags, limits…
- Policies - job size/age, priority, fair share, reservation,…
Methods to optimize mapping of requested resources to the available resources:
| Method | Description | 
|---|---|
| time sharing | Allocate multiple workloads from one or more user on a single node | 
| backfilling | Schedule pending workloads out of order to maximize utilization | 
| job chunking | Run similar workloads of multiple users simultaneously | 
| bin packing | Group workloads of multiple user to optimize utilization | 
| gang scheduling | Allow users to submit multiple process within a single workload | 
| job dependencies | Allow user to define workflows for execution (direct acyclic graphs (DAGs)) | 
User defined factors contributing to scheduling decisions:
| Factor | Description | 
|---|---|
| fair-share | Uses historic resource utilization as a factor to balance target shares | 
| quality of service | Special treatment of jobs based on specified criteria | 
| reservation | All to reserve resources in advance | 
| job dependencies | Task workflows/sequencing defined by users | 
Multi-factor priority is the use of multiple factors to determine the execution of workloads.
Workload Management
Responsibilities/components of a workload management system:
| Component | Description | 
|---|---|
| scheduling | Allocate resources and assign a workload from the queue | 
| lifecycle management | Receive and queue workloads, prioritize/sort candidate workloads | 
| resource management | Collect resource capabilities and state information for the scheduler | 
| execution/monitoring | Launch the workload, track its state and collect performance metrics | 
Resource Management
Typically involves following capabilities:
| Capability | Description | 
|---|---|
| heterogeneous resources | The scheduler can accommodate different hardware configurations | 
| allocation policy | Prioritization of jobs according to specific resource requirements (i.e. run-time) | 
| consumable resources | Enforced static (CPUs, RAM) and dynamic (load, bandwidth) resource constrains | 
| network-aware scheduling | Consideration of network topology by the scheduler for job allocation | 
| accelerator support | Specialized hardware GPUs, FPGAs, PHIs (Intel) | 
| power capping | Limit the power consumption of workloads | 
- At minimum resources are managed at the granularity of a node
- Typically fine-grained management of CPU cores & memory
- Eventually includes software licenses, network bandwidth, shared storage space, accelerators
Workload placement
Usually highly configurable:
- Replacement/reordering - Dynamically manged job order in the queue to react to state changes (resource, jobs, etc)
- Power-aware scheduling - Drain, shutdown unused nodes to minimize power consumption
- Prolog/epilog support - Allow scripts to be executed before/after a job
- Data staging - Support for copying files to local storage before job execution
- Checkpointing - Allow applications to save execution state in intervals (fault-tolerance)
- Job migration - Move jobs during execution to mitigate failure states (rebalancing of long-running or service-oriented jobs)
- Job restarting - Automatic restart of aborted of failed jobs
- Job preemption - Suspend low-priority jobs to free resources of required