AI — LLM Inference Service
Overview
LLM inference1 is the process of performing inference (generate predictions) on a trained LLM
- Large Language Model (LLM)2, pretrained as next-word predictors…
- …take a series of tokens as input
- …generate (predict) subsequent tokens autoregressively
- …until a stopping criteria (typically output length limitation)
- Inference requests vary in terms of number of input and output tokens
- …inference is embarrassingly parallel at the level of requests
Phases
Process involves two phase…
- Prefill — Processing the input
- Process input tokes …intermediate states (keys and values)
- …used to generate the “first” new token
- Resouce intensive …highly parallel matrix-matrix operation
- Each new token depends on all the previous tokens
- Decode — Generate the output
- …generates output tokens autoregressively one at a time
- …needs to know all the previous iterations’ output states (keys and values)
- Less resource intensive (compared to prefill) …matrix-vector operation
Components
Components in an inference engine…
- Batching — Groups multiple queued requests into single execution batch
- …all transfered at once …reduces overhead from individual GPU calls
- …requests share a model …memory cost of weights is spread out
- …improves parallel processing efficiency
- …static batching suboptimal on variable-length output …all wait for the longest
- KV Cache (decoder phase)
- …self-regressive text-generation outputs a token each step
- …each tokens depends on key/value (tensor) of all input + previous tokens
- …use of a dynamic key-value (KV) store …avoid recomputing
- …the KV cache scales linear to the sequence length
Optimization3
- Quantization — Reduce numerical precision
- …reduce memory footprint & accelerates inference over FP32 (full precision)
- Lower precision formats FP16 (half-precision), FP8, FP4 (for edge computing)
- Continuous Batching — Dynamically adds new requests to a running GPU batch
- Server maintains a single batch of requests from various users
- New requests are introduced into the batch from the queue…
- …while new requests are introduced into the batch from the queue
- Small requests can be processed quickly, without waiting for larger requests
- PagedAttention — Sharing of KV cache blocks for requests with similar prompts
- Related to continuous batching …inspired by OS virtual memory
- Fixed size block for KV cache …lookup table to map KV blocks
- Eliminates excessive memory fragmentation in static KV cache allocation
- Speculative Decoding — 2-model workflow to reduce inference latency
- Draft model (smaller/quicker) generates prediction tokens before main model
- Main model validates draft tokens until mismatch …continues generation from there
Models
ONNX4 (Open Neural Network Exchange)
- Open source standard to represnd LLMs in a serialized file
Memory
LLM memory (VRAM) requirements defeind by…
- Model Parameters (weights)
- e.g. Llama 2 7B * sizeof(FP16) ~= 14GB memory
- KV cache occupied by the self-attention tensors
- …allocates dedicated memory for each request
Architecture
Component | Description |
---|---|
UI | Front-end for user input (web/mobile) |
API Gateway | Access point ot the backend service |
Load Balancer | Distrubute incoming request from the API gateway |
Inference Server | Process input data via LLM and generate output |
Model Repository | Stores trained LLMs |
Pre Processing | Prepare input for the inference model |
Post Precessing | Format output for the user |
Database | Store user interactions, model metadata |
Inference Server
Inference Server — Acts as bridge between LLM and users inference requests
Inference Engine — Executes traind LLM efficiently on hardware
- Query queue — Manages incoming requests in a queue
- …to avoid overwhelming the model
- …group multiple requests together into a batch
- Model execution — Runs the LLM model
- …allocates required hardware (typically a GPU)
- …trained models in
.pt
PyTorch or.tf2
TensorFlow format - …model takes the batched input and generates output tokens
- Query repsonse — Gathers model outputs
- …splits into individual responses for each original request
Related projects:
Load Balancing
Orchestration of multiple LLM services…
- Llumnix11
Metrics
Performance metrics:
- TTFT (Time to First Token)
- Responsiveness from the user perspective
- Time until a model begins to generate output
- Responsiveness from the user perspective
- TPOT (Time per Output Token) aka ITL (Inter Token Latency)
- Average time to generate each subsequent token
- Overall fluidity go generation experience by the user
- Total Time to Generation
- Measures total time to process input and generate all tokens
- ISL (Input Sequence Length)
- OSL (Output Sequence Length)
- Throughput (tokens per second)
- Total output generation capacity
- Measure for hardware utilizations related to output
Benchmarking needs to be aligned with workload profile…
- Low-concurrency …few simulations users …evaluate baseline performance
- High-concurrency …many simultaneous users …evaluate throughput scaling
- Variable-length workloads …different prompt length to stress memory & scheduling
Maximum Batch Weight12 — Maximum allowed ‘volume’ of the batch…
- …total number input and output tokens of all requests processed
- …larger memory space …the more requests can be processed in parallel
- …improves the throughput of the inference service
Questions
What is the difference between LLM training and inference?
- Training — Teach a model to recognize patterns …make accurate predictions
- …computationally intensive …requires expensive GPU/TPU clusters
- …initial training cost high …one-time expense to create the model
- …periodical retraining required to update/improve the model
- Inference — Apply a trained model to make predictions
- …continuously …responds in real-time to user input
- …smaller resource requirements then training
- …resource requirements cumulative with the number of user requests
What is the difference between serverless vs. self-hosted LLM inference?
- Serverless — Use a provider like OpenAI13 , Antropic14
- As simples as: Send prompt …get response …pay by the token
- Providers typically charge based on the number of tokens processed
- Benefits …no infrastructure required, rapid prototyping, hardware abstraction
- Potential limits …cost, latency, customization, privacy, compliance
- Self-hosted — On premiss deployment of an LLM inference service
- Requires hardware infrastucture and deployment/operations expertise
- Benefits …full control over models, deployment, costs, privacy, complianze
What is the difference between proprietary and open-source models?
- Proprietary — GPT-4 (OpenAI), Claude (Antropic)
- …developed by private companies
- …often with state-of-the-art performance
- …usually through API calls or cloud services
- Open Source — DeepSeek-R1, Llama 4, Mistral
- …publicly available code and weights
- …community-driven improvements
- …download …runs on local infrastucture
- LMArena15 comparison of proprietary and open-source models
References
Footnotes
LLM Inference Handbook
https://bentoml.com/llm↩︎Hugging Face Models
https://huggingface.co/models↩︎Mastering LLM Techniques: Inference Optimization, NVIDIA (2025/09/24)
https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/↩︎ONNX Runtime Documentation
https://onnxruntime.ai/docs/↩︎vLLM, GitHub
https://github.com/vllm-project/vllm↩︎TensorRT-LLM, NVIDIA, GitHub
https://github.com/NVIDIA/TensorRT-LLM↩︎Hugging Face TGI, GitHub
https://github.com/huggingface/text-generation-inference↩︎SGLand, GitHub
https://github.com/sgl-project/sglang↩︎DeepSpeed, GitHub
https://github.com/deepspeedai/DeepSpeed↩︎Trition (NVIDIA), GitHub
https://github.com/triton-inference-server/server↩︎Llumnix, Alibaba, GitHub
https://github.com/AlibabaPAI/llumnix↩︎LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services
https://arxiv.org/html/2410.02425↩︎OpenAI
https://openai.com↩︎Antropic
https://www.anthropic.com↩︎LMArena
https://lmarena.ai/leaderboard↩︎