Central Processing Unit (CPU)
Architecture
A computer is a machine to process information (data):
- A processor (typically an integrated circuit) performs operations on data
- It processes input information to generate a desired output information
- Input/output data loaded/stored from/to memory a data storage area
- Electronic devices operate binary signals (electricity on/off)
- Binary expressed by two symbols -
0
and1
➜ binary digits (bits) - Numbers, text represented as binary patterns ➜ combinations of zeros and ones
Example binary patterns (8bit ASCII encoding):
Binary | Character |
---|---|
0011 0000 | 0 (zero) |
0011 0010 | 1 (one) |
0011 0010 | 2 (two) |
0100 0001 | A |
0100 0010 | B |
0100 0100 | C |
Machine Code
Machine code (machine language) ➜ instructions executed by a processor
- Instruction ➜ operation code (opcode), operand
- Operand ➜ (memory address to the) data to operate on
- Opcodes & operands encoded as binary code
- Machine code programm ➜ sequence of instructions (opcodes & operands)
- Assembly language
- Symbolic representation of machine instructions
- Symbolic names for opcodes ➜ mnemonic codes
- Assembler ➜ translates assembly language into machine code
Example machine operations code (Intel 8085):
Opcode (binary) | Mnemonic | Description |
---|---|---|
1000 0111 | ADD | Add contents of register to accumulator |
0011 1010 | LDA | Load data from memory address |
0011 0010 | STA | Store data to memory address |
0111 1001 | MOV | Move data from between registers |
1100 0011 | JMP | Jump to memory address |
Object Code
Object code is a sequence of instructions in machine code generated by an assembler
- Executable programs are typically build from reusable code fragments (sub-program, function/modules)
- Fragments get individually compiled (translated) into object code
- A complete program it then build by combining various object code fragments
- Individual fragments are referenced using a symbol (function name)
- Object file (relocatable format machine code)
- File format use to store object code and related data (e.g. ELF)
- Structured as separated segments/sections of different types of data
- A linker program combines object code to generate executable machine code
- Relocation assigns load addresses to various object code fragments
- A linker resolves symbols using assigned memory locations and patching the calling object code to that location (call instruction reference)
- A loader places executable machine code into (main) memory and prepares it for execution
- Allocates regions in memory corresponding to segments in the machine code
- A program loader is part of a computer operating system (starts the program once its loader)
- A microcontroller typically do not have a loader, instead the executable machine code is starter directly from memory
Object Files
Object files contain five kinds of information
- Header: Metadata, code size, format specification, etc.
- Object code: Binary instructions and data generated by an assembler(/compiler)
- Relocation: List of places in the object code a linker needs to change the address of
- Symbols: Global symbols defined in the module, symbols to be imported from other modules
- Debugging: Information about the object code need by the linker for debugging
ELF (Executable and Linking Format) file com in three flavors:
- Relocatable: Create by assemblers(/compilers), need to processed by a linker
- Executable: Address relocation done, symbols resolved (except for shared library symbols), ready for execution
- Shared object: Shared libraries including symbol information for linkers and executable code for run-time
Processor
A processing unit, aka CPU (Central Processing Unit):
- Active part of a computer ➜ datapath and control
- Control ➜ Commands the datapath, memory, and input/output devices according to machine instructions
- Datapath ➜ Components of a processor that perform arithmetic operations
- Processors ➜ fetch (read) instruction from memory before instruction execute
Separation of processor and memory distinguishes programmable computers
Control-flow Architecture
Stored program computer: instructions and data stored in memory
- Harvard architecture: Separate memory for data and instructions
- Two sets of address/data buses between processor and memory
- Allow simultaneous memory fetches
- Modified Harvard architecture: Separate memory for data and instructions
- Instruction memory can be used to store data
- Two pieces of data can be loaded in parallel
- Von Neumann architecture: Single memory holds data and instructions
- Single set of address/data buses between processor and memory
- Values in memory interpreted depending on a control signal
- Current instruction identified by the instruction pointer (program counter)
- Sequential instruction processing (fetch, execute, and complete) one at a time
- The instruction pointer is advanced sequentially except for control transfer
- Instructions executed in control flow order
Data-flow Architecture
- Instructions executed based on the availability of input arguments, data flow order
- Conceptually no instruction pointer required since execute based on data dependencies
- Inherently more parallel with the potential to execute many instructions at the same time
Control- vs data-flow trade-offs:
- Ease of programming
- Ease of compilation
- Extraction of parallelism (performance)
- Hardware complexity
Instruction Set Architecture
The Instruction Set Architecture (ISA) specifies how a programmer sees instructions to be executed:
- Defines an interface between software and hardware enabling the implementation of programs
- Modern ISAs are mostly control-flow architectures: x86, ARM, MIPS, SPARC, POWER
- ISAs have a very long lifetime (compared to µarch) staying backwards-compatible while being extended with additional instructions
The ISA includes all functionality exposed to the programmer:
- Instructions: Opcodes, addressing modes, data types, registers, condition code…
- Memory: Address space, alignment, virtual memory…
- Interrupt/exception handling, access control, priority/privileges
- Task/thread management, power & thermal management
- Multi-threading & multi-processing support
ISA Types:
- Reduced Instruction Set Computer (RISC)
- Compact, uniform instruction size ➜ easier to decode ➜ facilitates pipelines
- Complexity implemented as series of smaller instructions
- More lines of code ➜ bigger memory footprint
- Allow effective compiler optimization
- Complex Instruction Set Computer (CISC)
- Extremely specific instructions (doing as much work as possible)
- Instructions not uniform in size ➜ difficult to decode
- Pipelines requires break down of instructions into smaller components at processor level
- High code density
- Complex processor hardware
- Very long instruction word (VLIW)
- Execute multiple instructions concurrently, in parallel
- Instruction Level Parallelism (ILP)
- Compiler bundles multiple instructions that can be executed in parallel into a single long instruction
Microarchitecture
The Microarchitecture (µarch) is the implementation of the ISA under specific design constrains and goals:
- The microprocessor is the physical representation (circuits) of the ISA and µarch
- Example: add instruction (ISA) vs adder implementation (µarch) [bit serial, ripple carry, carry lockahead, etc.]
- Example: x86 ISA has many implementations - Intel [2,3,4]86, Intel Pentium [Pro, 4], Intel Core, AMD…
- Design points: cost, performance, power consumption, reliability, time to market…
The µarch defines anything done in hardware and can execute instructions in any order (e.g. data-flow order) as long it obeys the semantics specified by the ISA:
- Pipeline instruction execution (Intel 486)
- Multiple instructions at a time (Intel Pentium)
- Out-of-order execution (Intel Pentium Pro)
- Speculative execution, branch prediction, prefetching
- Memory access scheduling policy, cache (levels, size, associativity, replacement policy)
- Clock gating, dynamic voltage and frequency scaling (energy efficiency)
- Error handling, correction
- Superscalar processing, multiple instructions (VLIW architecture, Intel Itanium)
- SIMD processing (vector/array processors, GPUs)
- Systolic arrays (Google tensor-processor)
Manufacturer
Intel
Xeon generations…
Date | Codename | Cores | Socket | Features |
---|---|---|---|---|
2017 | Skylake | 4-22 | LGA 3647 | 6xDDR4-2666 |
2019 | Cascade Lake | 4-28 | LGA 3647 | 6xDDR4-2933 |
2020 | Cooper Lake | 16-28 | LGA 4189 | 6xDDR4-3200 |
2021 | Ice Lake | 8-40 | LGA 4189 | 8xDDR4-3200, PCIe 4.0 |
2022 | Sapphire Rapids | -56 | LGA 4677 | HBM2e, DDR5, PCIe 5.0, CXL 1.1 |
2023 | Emerald Rapids | -60 | LGA 4677 | |
2024 | Sierra Forest | -144E | ||
^ | Granite Rapids | ?P | DDR5-8800 | |
2025 | Clearwater Forest | ?E |
- Sierra Forest introduces E- & P-cores (efficiency & performance)
- …distinct product lines
Hybrid CPU-GPU…
- …code-named “Falcon Shores”
- XPU…X is a variable…denotes multiple kinds of compute
- …first half of 2024
- …20A processes from Intel Foundry Services
- …Xeon SP socket (like “Granite Rapids” CPUs)
Intel on Demand (introduction with Sapphire Rapids)…
- …software-defined silicon (SDSi) service
- …optional service
- …act as a “try-before-you-buy program”
- …option to…
- …select fully featured
- …pick and choose features
- …two modes…
- …activation model…enable features through a one-time activation
- …state information…shared with Intel…SDSi-enabled data-center
- …consumption model…through as-a-service offerings
- …activation model…enable features through a one-time activation
AMD
x86 processors….
Date | CPU-Family | Architecture |
---|---|---|
1996-1997 | K5 | x86 |
1997-1998 | K6 | x86 |
1999-2002 | K7 | x86 |
2003-2014 | K8 | x86-64 |
2007-2013 | K10 | x86-64 |
2011-2017 | Bulldozer | x86-64 |
2017-present | Zen | x86-64 |
Brand names…
- Desktop/Workstation…
- Athlon (2001-2019)
- Ryzen (2017-present)…high-end Ryzen Threadripper
- Server…
- Opteron (2003-2012)
- Epyc (2017-present)
Ryzen
Ryzen (desktop-grade) CPU generations…
Date | Series | Arch. | Gen. | Features |
---|---|---|---|---|
2017 | 1000 | Zen | 1 | |
2018 | 2000 | Zen+ | 1 | |
2019 | 3000 | Zen 2 | 2 | |
2020 | 4000 | Zen 2 | 2 | AM4 |
2021 | 5000G | Zen 3 | 3 | AM4, DDR4-3200, PCIe 3.0 |
2022 | 5000 | Zen 3 | 3 | AM4, DDR4-3200 |
2022 | 6000 | Zen 3+ | 3 | |
2022 | 7000 | Zen 4 | 4 | AM5, DDR5-5600, PCIe 5.0 |
2024 | 8000G | Zen 4 | 4 | AM5, DDR5-5600, PCIe 4.0 |
Sockets…
- AM4 for Ryzen 4000,5000
- AM5 for Ryzen 7000
Chipsets…
- Chipset 300-series, 1st,2nd,3rd Gen CPUs
- Chipset 400-series… 1st,2nd,3rd Gen CPUs
- Chipset 500-series, 2nd,3rd,4th Gen CPUs
Chipset classes, X & B support overclocking
- Premium X{3,4,5,6}70
- Midrange B{3,4,5,6}50
- A{3,4,5,6}20
Epyc
Epyc (server-grade) CPU generations
Date | Series | Arch. | Cores | Socket | Features |
---|---|---|---|---|---|
2017 | 7001 Naples | Zen | 32 | SP3 | DDR4-2666, PCIe3 |
2019 | 7002 Rome | Zen 2 | 64 | SP3 | DDR4-3200, PCIe4 |
2021 | 7003 Milan | Zen 3 | 64 | SP3 | DDR4-3200, PCIe4 |
2023 | 8004 Siena | Zen 4c | 96 | SP6 | DDR4-4800, PCIe5, CXL 1.1 |
2022 | 9004 Genoa | Zen 4/4c | 128 | SP5 | DDR5-4800, PCIe5, CXL1.1 |
2024 | Turin | Zen 5 | 128 | ||
2025 | Venice |
Naming Convention
EPYC 9554P
||||`---- feature modifier
|||`----- generation
||`------ performance
|`------- core count `-------- product series
Fabrication
- Process node…
- …manufacturing process and design of a CPU made through lithography
- …nm (nano-meter) used to measure the size of the transistor
- Lower nm…
- …more power efficient
- …less cooling requirements
- …faster transistor switching
- …higher transistor density
- 14nm, 7nm, etc …primarily marketing terms …refer to improved generation of chips
Foundries…
- In order or market share in 2023…
- 55% TSMC (Taiwan)
- 13% Samsung (Korea)
- 7% Globalfoundries (USA)
- 5% SMIC (China)
- EU Chips Act (2022/02) …strengthen semiconductor production in the EU
Chiplets …small, modular chips …combined to system-on-chip (SoC)
- …used in a chiplet-based architecture
- …increases design flexibility …reduce production cost
- …can improve performance …reduce power consumption
- UCIe (Universal Chiplet Interconnect Express)
- …standard bushed bu Intel, AMD & Samsung
- …integration of chiplets from different manufacturers
- Modern CPUs composed of separate modules…
- …compute core, memory controller, PCIe bus…
- …modules typically build in different fabrication technologies
Configuration
NUMA
Non-Uniform Memory Access (NUMA)
- Multiple processors, collectively called node (aka cell, zone), are physically grouped on a socket.
- Each node has high speed access to a local dedicated memory bank.
- An interconnect bus provides connections between nodes, so that all CPUs can still access all memory
- There is a performance penalty for processors accessing non-local memory.
/sys/devices/system/node
contains information about NUMA nodes in the system, and the relative distances between those nodes
dnf install -y hwloc numactl
numactl --hardware # examine the NUMA layout
lstopo # show memory and CPU topology
Frequency Scaling
Dynamic Frequency Scaling (aka CPU throttling)
- CPU support: Intel SpeedStep, AMD Cool’n’Quiet
- Note that firmware may configure frequency and thermal management
- Lower clock speed results in a slower CPU consuming less energy
- Frequency scaling governors in the kernel support:
- CPU frequency/voltage mappings
- Upper/lower frequency limits
- Strategies to switch between mappings
watch grep \"cpu MHz\" /proc/cpuinfo # monitor cpu speed
cpupower frequency-info # show throttling configuration
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # current power scheme for the CPU
cpupower frequency-set -g <governor> # activate a particular power scheme
cpufreq-info # show throttling configuration
/etc/default/cpufrequtils # power sheme configuration
### depricated with linux kernel <2.3.36
grep throttling /proc/acpi/processor/CPU*/info # show state of throttling control
grep -e ^active -e \*T /proc/acpi/processor/CPU*/throttling # active configuration if enabled