Computer Memory Basics
Memory systems a major factor in determining performance of a computer:
- Programs exhibit temporal locality (tendency to reuse data accessed recently)…
- …and spatial locality (tendency to reference data close to recently used data)
- Memory hierarchies take advantage of temporal/spatial locality while moving data…
- …between fast/small upper level memory and big/slow lower level memory
Addresses
- Memory cell (electronic circuit)…store one bit of binary data
- Memory address…reference to a specific memory location
- Usually several memory cells share a single address (e.g. 8bits/1byte)
- The address width limits the maximum addressable memory
- The address width is typically a multiple of eight (8,16,32,64 bits)
| Address width | Address locations |
|---|---|
| 8bit | 256 (2^8) |
| 16bit | 65536 (2^16) |
| 32bit | 4294967296 (2^32) |
| 64bit | 1.844674407×10^19 (2^64) |
- Memory is a collection of various memory locations
- Each location has a unique address which can be accessed in any order (in equal amount of time)
- Memory access means selection and data read/write from a specific memory location
- Memory controller…manages data flow (read/write) between main memory and processor
- Memory address bus…connects the main memory to the memory controller
Programmers see virtual memory provided by the system (OS + hardware):
- Simplified abstraction of memory for the program providing the illusion of “infinite” memory
- The system manages the physical memory space transparent to the programmer by mapping virtual memory addresses to the limited physical memory
- Example for the programmer/(micro) architecture trade-off
CPU Cache
…used to avoid repeated access to main memory (typically DRAM):
- Automatically managed memory hierarchy (Level 1,2,3) (typically SRAM)
- Stores frequently used data and is commonly on-die with an associated CPU
Blocks
Memory logically divided into fixed-size blocks…
- …block (or line)
- …minimum unit of information
- …either present or net present in a cache
- …block maps to a location in cache
- …determined by the index bits in the address
- Cache…
- …hit …use cached data instead if accessing next level memory
- …miss …data not cached, read block from next level memory
- Hit ratio …percentage of accesses that result in cache hits
- Miss rate …(1-hit rate) …fraction of memory accesses not found
- Hit time…
- …time to access a level of the memory hierarchy
- …includes time to determine hit/miss
- Miss penalty…
- …time to replace a block in the upper level…
- …with a block from the lower level
AMAT (Average Memory Access Time) …metric to analyze memory system performance
Locality
- …ensures data required by processor kept in fast(er) level(s):
- …recently accessed and adjacent data
- …automatically in fast memory close to processor
- …temporal locality…
- …based on repetitive computations
- …e.g. loops …referencing the same memory
- …spatial locality…
- …based on a probability of related computations
- …referencing a cluster of memory (e.g. array)
Associativity
Caches fall into three categories:
- Direct-mapped
- Each memory location maps into one and only one cache block
- Fast, simple, inefficient
- Maximum cache misses
- Fully associative
- Any memory location can map to anywhere in the cache
- Slow, complex, efficient
- Perfect replacement policy (no cache misses)
- N-way set associative
- Groups of blocks “sets” from associative pools
- A compromise between simplicity and efficiency
- Reduces cache misses
Types of cache misses:
- Compulsory (start miss): First access to a block, must be brought into the cache
- Capacity: Blocks are being discarded to free space
- Conflict (collision/interference miss): Occurs when several memory locations are mapped to the same cache block
Replacement policy: Heuristic used to select the entry to replaced by uncached data (LRU (Least Recently Used))
lshw
Display the cache/memory hierarchy with lshw:
>>> sudo lshw -C memory -short
H/W path Device Class Description
======================================================
/0/0 memory 64KiB BIOS
/0/400/700 memory 256KiB L1 cache
/0/400/701 memory 1MiB L2 cache
/0/400/704 memory 8MiB L3 cache
/0/1000 memory 12GiB System Memory
/0/1000/0 memory 4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/1000/1 memory 4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/1000/2 memory 4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/1000/3 memory DIMM DDR3 Synchronous [empty]
/0/1000/4 memory DIMM DDR3 Synchronous [empty]
/0/1000/5 memory DIMM DDR3 Synchronous [empty]lstopo
Show the topology of the system…
- …examine the NUMA topology
- …provide details about processor, caches and memory
- Documentation …https://www.open-mpi.org/projects/hwloc/doc/
Textual rendering…
lstopo-no-graphics -.asciiTypes
Volatile Memory
- SRAM (Static Random Access Memory) - Two cross coupled inverters store a bit persistent (while powered)
- Faster access (no capacitor), no refresh needed, access time close to cycle time
- Lower density (6-8 transistors per bit), higher cost
- Minimal power to retain charge in standby mode
- Manufacturing compatible with logic process, typically integrated with the processor chip
- DRAM (Dynamic Random Access Memory) - Capacitor charge state indicates stored value, cells lose charge over time requiring a refresh
- Slower access (capacitor)
- Higher density (1 transistor + 1 capacitor per bit), lower cost
- Requires periodic refresh (read + write), (costs power, performance, circuitry)
- Manufacturing requires capacitors and logic
- SDRAM (Synchronous DRAM) - Uses a clock to eliminate the time memory and processor need to synchronize
- Bandwidth improved by internal organization into multiple banks each with its own row buffer
- Banks allow simultaneous read/write calle address interleaving
- Fastest version called DDR (Double Data Rare SDRAM), data transfer at rising & falling edges of the clock
Persistent Memory
Persistent Memory (PM, pmem), aka SCM (Storage Class Memory):
- Bridge the access-time gap between DRAM and NAND based flash-storage
- Introducing a third tier in the memory hierarchy
- Connected to the system memory bus (like DRAM DIMMs) via NVDIMMs
- Accessed like volatile memory (processor load/store instructions)
- Change in computing architecture…
| Access time | Description |
|---|---|
| 1ns | processor operation |
| <5ns | read L2 cache |
| 60ns | access volatile memory (DRAM) |
| <<1us | access persistant memory (NVM) |
| 20us | read from flash memory (NAND) |
| 1ms | random write to flash memory |
| <10ms | read/write disk |
| 40s | read tape |
- Requires an [NVM Programming Model][pmem]
- New block and file semantics to applications
- Exposed as memory-mapped file by the operating system
- Persistent memory aware file-system allows DAX (Direct Access) without using (bypass) the system page cache (unlike normal storage-based files)
- Application has direct load/store access to persistence via the MMU
- No interrupts or kernel context switches
- OS (only) flushs CPU caches to get data into the persistence domain
NVM (Non-Volatile Memory), NVRAM (Non-Volatile RAM):
- RRAM/ReRAM (Resistive Random-Access Memory)
- Uses a dielectric solid-state material aka memristor
- In development by multiple companies…
- Scalable below 30nm, cycle time <10ns
- Others…
- CBRAM (Conductive-Bridging RAM)
- PRAM (Phase-Change Memory)
- MRAM (Magnetoresistive RAM)
- FeRAM (Ferroelectric RAM)
- STTRAM (Spin Torque Transfer RAM)
- SHERAM (Spin Hall Effect RAM)
- CNTRAM (Carbon-nanotube RAM)
- Products:
- 3D XPoint (Intel, Micron), called Intel Optane
DIMM Modules
Memory sold in small boards called DIMM (Dual Inline Memory Module)…
- …typically contains 4-16 DRAMs chips
- …normally organized to be 8 bytes wide
- …variants of DIMM slots (i.e. DDR3 or DDR4) have different pin counts
- …ECC (Error-Correcting code) DIMMs have extra circuitry to detect/correct errors
Following a list of common RAM chips and their throughput:
| Standard | Chip | GB/s |
|---|---|---|
| SDRAM (1993) | SDR-66 | 0.53 |
| SDR-133 | 0.8 | |
| DDR (1996) | DDR-200 | 1.6 |
| DDR-266 | 2.13 | |
| DDR2 (2003) | DDR2-400 | 3.2 |
| DDR2-800 | 6.4 | |
| DDR3 (2007) | DDR3-1600 | 12.8 |
| DDR3-1866 | 14.93 | |
| DDR4 (2012) | DDR4-2133 | 17 |
| DDR4-3200 | 24 | |
| DDR5 (2020) | DDR5-4800 | 41.6 |
| DDR5-5200 | 44.8 | |
| DDR5-6400 | 54.4 | |
| DDR5-6800 | 57.6 |
NVDIMM
NVDIMMs types:
- NVDIMM-F - Flash only paired with DRAM DIMM
- NVDIMM-N - Flash and DRAM together in the same DIMM
- NVDIMM-P - True persistant memory (no DRAM/flash)
Supported modes (use ndctl for management):
- Raw
/dev/pemmN(block devices)- Default mode after installation
- Supports file-systems with or without DAX (ext4,xfs)
- Sector
/dev/pemmNs(block device with sector atomicity)- Implemented with BTT (Block Translation Table)
- Guarantees power-fail write atomicity
- Only supports file-systems without DAX
- Memory
/dev/pemmN(block device supporting device DMA)- Supports file-system DAX, Recommended over raw mode
- Requires storing extra “struct page” entries on regular system memory (or persistent memory)
- DAX
/dev/daxN.M(character device supporting DAX)- Allows memory allocation/mapping (without the need of a file-system)
- No interactions with the kernel page cache
- Character device (does not support a file-system)
- Requires storing extra “struct page” entries on persistent memory
dmidecode
Display the memory vendor, identification numbers, and type
>>> dmidecode --type memory | egrep "Manufacturer|Serial|Part|Type"
Error Correction Type: Multi-bit ECC
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Manufacturer: Samsung
Serial Number: 35244B2E
Part Number: M393B2G70BH0-YK0
...Maximum RAM capacity can be checked with dmidecode. The “Maximum Capacity” is the maximum RAM supported by your system, while “Number of Devices” is the number of memory (DIMM) slots available on your computer.
>>> dmidecode -t 16 | egrep "Capacity|Devices"
Maximum Capacity: 384 GB
Number Of Devices: 32Check the memory support matrix for the system board to understand the correct DIMM distribution and their corresponding memory frequencies.
Frequency & Voltage
Check the memory speed with lshw (package lshw):
>>> lshw -short -C memory | grep DIMM
/0/1b/0 memory 16GiB DIMM DDR3 Synchronous 800 MHz (1.2 ns)
...Details about voltage and maximum memory frequency with decode-dimms from the Debian package i2c-tools:
>>> modprobe eeprom
>>> decode-dimms
[…]
Fundamental Memory type DDR3 SDRAM
Module Type RDIMM
[…]
Maximum module speed 1600MHz (PC3-12800)
Size 16384 MB
[…]
Operable voltages 1.5V, 1.35VHBM
HBM - (High-Bandwidth Memory) …standardized stacked memory technology
| Standard | Date | Bandwidth¹ | Stack² | Size³ |
|---|---|---|---|---|
| HBM | 2013 | 256 GB/s | ||
| HBM2 | 2016 | 307 GB/s | 4 | 8 GB |
| HBM2e | 2020 | 460 GB/s | 8 | 16 GB |
| HBM3 | 2022 | 819 GB/s | 12 | 24 GB |
¹ max bandwidth per package
² max number of memory dies in stack
³ max capacity per package
Increase memory interface performance…
- …improves bandwidth…access times…transfer rates
- …more power-efficient in terms of bits per watt
- …GDDR5…10.66GB/s per watt
- …HBM2e…35GB/s per watt
- …no fundamental change in the underlying memory technology
Utilizes 3D manufacturing technology…
- …2.5D packaging solution
- …stacks of DRAM chips on top of a bus interface
- …placed side-by-side on top of an silicon interposer
- …interposer acts as the bridge between the chips and a board
- …requires the fabrication of what is basically a PCB in silicon
- …brings logic closer to the memory, enabling more bandwidth
- …comes with thermal management challenges
Capacity…
- …limited compared to DRAM accessed through DDR
- …memory defined is cubes…
- …defined height…4,8,12 or 16 (with HBM3)
- …defined number of data channels (64/128 bits)
- …limited number of HBM dies can fit around the SoC
- …HBM capacities can not rival the capacity of DDR