Central Processing Unit (CPU)

Hardware
Published

April 29, 2014

Modified

February 20, 2024

Architecture

A computer is a machine to process information (data):

  • A processor (typically an integrated circuit) performs operations on data
  • It processes input information to generate a desired output information
  • Input/output data loaded/stored from/to memory a data storage area
  • Electronic devices operate binary signals (electricity on/off)
  • Binary expressed by two symbols - 0 and 1binary digits (bits)
  • Numbers, text represented as binary patterns ➜ combinations of zeros and ones

Example binary patterns (8bit ASCII encoding):

Binary Character
0011 0000 0 (zero)
0011 0010 1 (one)
0011 0010 2 (two)
0100 0001 A
0100 0010 B
0100 0100 C

Machine Code

Machine code (machine language) ➜ instructions executed by a processor

  • Instruction ➜ operation code (opcode), operand
  • Operand ➜ (memory address to the) data to operate on
  • Opcodes & operands encoded as binary code
  • Machine code programm ➜ sequence of instructions (opcodes & operands)
  • Assembly language
    • Symbolic representation of machine instructions
    • Symbolic names for opcodes ➜ mnemonic codes
  • Assembler ➜ translates assembly language into machine code

Example machine operations code (Intel 8085):

Opcode (binary) Mnemonic Description
1000 0111 ADD Add contents of register to accumulator
0011 1010 LDA Load data from memory address
0011 0010 STA Store data to memory address
0111 1001 MOV Move data from between registers
1100 0011 JMP Jump to memory address

Object Code

Object code is a sequence of instructions in machine code generated by an assembler

  • Executable programs are typically build from reusable code fragments (sub-program, function/modules)
    • Fragments get individually compiled (translated) into object code
    • A complete program it then build by combining various object code fragments
    • Individual fragments are referenced using a symbol (function name)
  • Object file (relocatable format machine code)
    • File format use to store object code and related data (e.g. ELF)
    • Structured as separated segments/sections of different types of data
  • A linker program combines object code to generate executable machine code
    • Relocation assigns load addresses to various object code fragments
    • A linker resolves symbols using assigned memory locations and patching the calling object code to that location (call instruction reference)
  • A loader places executable machine code into (main) memory and prepares it for execution
    • Allocates regions in memory corresponding to segments in the machine code
    • A program loader is part of a computer operating system (starts the program once its loader)
    • A microcontroller typically do not have a loader, instead the executable machine code is starter directly from memory

Object Files

Object files contain five kinds of information

  • Header: Metadata, code size, format specification, etc.
  • Object code: Binary instructions and data generated by an assembler(/compiler)
  • Relocation: List of places in the object code a linker needs to change the address of
  • Symbols: Global symbols defined in the module, symbols to be imported from other modules
  • Debugging: Information about the object code need by the linker for debugging

ELF (Executable and Linking Format) file com in three flavors:

  • Relocatable: Create by assemblers(/compilers), need to processed by a linker
  • Executable: Address relocation done, symbols resolved (except for shared library symbols), ready for execution
  • Shared object: Shared libraries including symbol information for linkers and executable code for run-time

Processor

A processing unit, aka CPU (Central Processing Unit):

  • Active part of a computer ➜ datapath and control
    • Control ➜ Commands the datapath, memory, and input/output devices according to machine instructions
    • Datapath ➜ Components of a processor that perform arithmetic operations
  • Processors ➜ fetch (read) instruction from memory before instruction execute

Separation of processor and memory distinguishes programmable computers

Control-flow Architecture

Stored program computer: instructions and data stored in memory

  • Harvard architecture: Separate memory for data and instructions
    • Two sets of address/data buses between processor and memory
    • Allow simultaneous memory fetches
  • Modified Harvard architecture: Separate memory for data and instructions
    • Instruction memory can be used to store data
    • Two pieces of data can be loaded in parallel
  • Von Neumann architecture: Single memory holds data and instructions
    • Single set of address/data buses between processor and memory
    • Values in memory interpreted depending on a control signal
  • Current instruction identified by the instruction pointer (program counter)
  • Sequential instruction processing (fetch, execute, and complete) one at a time
  • The instruction pointer is advanced sequentially except for control transfer
  • Instructions executed in control flow order

Data-flow Architecture

  • Instructions executed based on the availability of input arguments, data flow order
  • Conceptually no instruction pointer required since execute based on data dependencies
  • Inherently more parallel with the potential to execute many instructions at the same time

Control- vs data-flow trade-offs:

  • Ease of programming
  • Ease of compilation
  • Extraction of parallelism (performance)
  • Hardware complexity

Instruction Set Architecture

The Instruction Set Architecture (ISA) specifies how a programmer sees instructions to be executed:

  • Defines an interface between software and hardware enabling the implementation of programs
  • Modern ISAs are mostly control-flow architectures: x86, ARM, MIPS, SPARC, POWER
  • ISAs have a very long lifetime (compared to µarch) staying backwards-compatible while being extended with additional instructions

The ISA includes all functionality exposed to the programmer:

  • Instructions: Opcodes, addressing modes, data types, registers, condition code…
  • Memory: Address space, alignment, virtual memory…
  • Interrupt/exception handling, access control, priority/privileges
  • Task/thread management, power & thermal management
  • Multi-threading & multi-processing support

ISA Types:

  • Reduced Instruction Set Computer (RISC)
    • Compact, uniform instruction size ➜ easier to decode ➜ facilitates pipelines
    • Complexity implemented as series of smaller instructions
    • More lines of code ➜ bigger memory footprint
    • Allow effective compiler optimization
  • Complex Instruction Set Computer (CISC)
    • Extremely specific instructions (doing as much work as possible)
    • Instructions not uniform in size ➜ difficult to decode
    • Pipelines requires break down of instructions into smaller components at processor level
    • High code density
    • Complex processor hardware
  • Very long instruction word (VLIW)
    • Execute multiple instructions concurrently, in parallel
    • Instruction Level Parallelism (ILP)
    • Compiler bundles multiple instructions that can be executed in parallel into a single long instruction

Microarchitecture

The Microarchitecture (µarch) is the implementation of the ISA under specific design constrains and goals:

  • The microprocessor is the physical representation (circuits) of the ISA and µarch
  • Example: add instruction (ISA) vs adder implementation (µarch) [bit serial, ripple carry, carry lockahead, etc.]
  • Example: x86 ISA has many implementations - Intel [2,3,4]86, Intel Pentium [Pro, 4], Intel Core, AMD…
  • Design points: cost, performance, power consumption, reliability, time to market…

The µarch defines anything done in hardware and can execute instructions in any order (e.g. data-flow order) as long it obeys the semantics specified by the ISA:

  • Pipeline instruction execution (Intel 486)
  • Multiple instructions at a time (Intel Pentium)
  • Out-of-order execution (Intel Pentium Pro)
  • Speculative execution, branch prediction, prefetching
  • Memory access scheduling policy, cache (levels, size, associativity, replacement policy)
  • Clock gating, dynamic voltage and frequency scaling (energy efficiency)
  • Error handling, correction
  • Superscalar processing, multiple instructions (VLIW architecture, Intel Itanium)
  • SIMD processing (vector/array processors, GPUs)
  • Systolic arrays (Google tensor-processor)

Manufacturer

Intel

Xeon generations…

Date Codename Cores Socket Features
2017 Skylake 4-22 LGA 3647 6xDDR4-2666
2019 Cascade Lake 4-28 LGA 3647 6xDDR4-2933
2020 Cooper Lake 16-28 LGA 4189 6xDDR4-3200
2021 Ice Lake 8-40 LGA 4189 8xDDR4-3200, PCIe 4.0
2022 Sapphire Rapids -56 LGA 4677 HBM2e, DDR5, PCIe 5.0, CXL 1.1
2023 Emerald Rapids -60 LGA 4677
2024 Sierra Forest -144E
^ Granite Rapids ?P DDR5-8800
2025 Clearwater Forest ?E
  • Sierra Forest introduces E- & P-cores (efficiency & performance)
    • …distinct product lines

Hybrid CPU-GPU…

  • …code-named “Falcon Shores”
  • XPU…X is a variable…denotes multiple kinds of compute
  • …first half of 2024
  • …20A processes from Intel Foundry Services
  • …Xeon SP socket (like “Granite Rapids” CPUs)

Intel on Demand (introduction with Sapphire Rapids)…

  • …software-defined silicon (SDSi) service
  • …optional service
    • …act as a “try-before-you-buy program”
    • …option to…
      • …select fully featured
      • …pick and choose features
  • …two modes…
    • …activation model…enable features through a one-time activation
      • …state information…shared with Intel…SDSi-enabled data-center
    • …consumption model…through as-a-service offerings

AMD

x86 processors….

Date CPU-Family Architecture
1996-1997 K5 x86
1997-1998 K6 x86
1999-2002 K7 x86
2003-2014 K8 x86-64
2007-2013 K10 x86-64
2011-2017 Bulldozer x86-64
2017-present Zen x86-64

Brand names…

  • Desktop/Workstation…
    • Athlon (2001-2019)
    • Ryzen (2017-present)…high-end Ryzen Threadripper
  • Server…
    • Opteron (2003-2012)
    • Epyc (2017-present)

Ryzen

Ryzen (desktop-grade) CPU generations…

Date Series Arch. Gen. Features
2017 1000 Zen 1
2018 2000 Zen+ 1
2019 3000 Zen 2 2
2020 4000 Zen 2 2 AM4
2021 5000G Zen 3 3 AM4, DDR4-3200, PCIe 3.0
2022 5000 Zen 3 3 AM4, DDR4-3200
2022 6000 Zen 3+ 3
2022 7000 Zen 4 4 AM5, DDR5-5600, PCIe 5.0
2024 8000G Zen 4 4 AM5, DDR5-5600, PCIe 4.0

Sockets…

  • AM4 for Ryzen 4000,5000
  • AM5 for Ryzen 7000

Chipsets…

  • Chipset 300-series, 1st,2nd,3rd Gen CPUs
  • Chipset 400-series… 1st,2nd,3rd Gen CPUs
  • Chipset 500-series, 2nd,3rd,4th Gen CPUs

Chipset classes, X & B support overclocking

  • Premium X{3,4,5,6}70
  • Midrange B{3,4,5,6}50
  • A{3,4,5,6}20

Epyc

Epyc (server-grade) CPU generations

Date Series Arch. Cores Socket Features
2017 7001 Naples Zen 32 SP3 DDR4-2666, PCIe3
2019 7002 Rome Zen 2 64 SP3 DDR4-3200, PCIe4
2021 7003 Milan Zen 3 64 SP3 DDR4-3200, PCIe4
2023 8004 Siena Zen 4c 96 SP6 DDR4-4800, PCIe5, CXL 1.1
2022 9004 Genoa Zen 4/4c 128 SP5 DDR5-4800, PCIe5, CXL1.1
2024 Turin Zen 5 128
2025 Venice

Naming Convention

EPYC  9554P
      ||||`---- feature modifier
      |||`----- generation
      ||`------ performance
      |`------- core count
      `-------- product series

Fabrication

  • Process node…
    • …manufacturing process and design of a CPU made through lithography
    • …nm (nano-meter) used to measure the size of the transistor
  • Lower nm…
    • …more power efficient
    • …less cooling requirements
    • …faster transistor switching
    • …higher transistor density
  • 14nm, 7nm, etc …primarily marketing terms …refer to improved generation of chips

Foundries…

  • In order or market share in 2023…
    • 55% TSMC (Taiwan)
    • 13% Samsung (Korea)
    • 7% Globalfoundries (USA)
    • 5% SMIC (China)
  • EU Chips Act (2022/02) …strengthen semiconductor production in the EU

Chiplets …small, modular chips …combined to system-on-chip (SoC)

  • …used in a chiplet-based architecture
  • …increases design flexibility …reduce production cost
  • …can improve performance …reduce power consumption
  • UCIe (Universal Chiplet Interconnect Express)
    • …standard bushed bu Intel, AMD & Samsung
    • …integration of chiplets from different manufacturers
  • Modern CPUs composed of separate modules…
    • …compute core, memory controller, PCIe bus…
    • …modules typically build in different fabrication technologies

Configuration

NUMA

Non-Uniform Memory Access (NUMA)

  • Multiple processors, collectively called node (aka cell, zone), are physically grouped on a socket.
  • Each node has high speed access to a local dedicated memory bank.
  • An interconnect bus provides connections between nodes, so that all CPUs can still access all memory
  • There is a performance penalty for processors accessing non-local memory.
  • /sys/devices/system/node contains information about NUMA nodes in the system, and the relative distances between those nodes
dnf install -y hwloc numactl
numactl --hardware               # examine the NUMA layout 
lstopo                           # show memory and CPU topology

Frequency Scaling

Dynamic Frequency Scaling (aka CPU throttling)

  • CPU support: Intel SpeedStep, AMD Cool’n’Quiet
  • Note that firmware may configure frequency and thermal management
  • Lower clock speed results in a slower CPU consuming less energy
  • Frequency scaling governors in the kernel support:
    • CPU frequency/voltage mappings
    • Upper/lower frequency limits
    • Strategies to switch between mappings
watch grep \"cpu MHz\" /proc/cpuinfo                         # monitor cpu speed
cpupower frequency-info                                      # show throttling configuration
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor    # current power scheme for the CPU
cpupower frequency-set -g <governor>                         # activate a particular power scheme
cpufreq-info                                                 # show throttling configuration
/etc/default/cpufrequtils                                    # power sheme configuration
### depricated with linux kernel <2.3.36
grep throttling /proc/acpi/processor/CPU*/info               # show state of throttling control
grep -e ^active -e \*T /proc/acpi/processor/CPU*/throttling  # active configuration if enabled