AI — LLM Inference Service

AI
Published

September 23, 2025

Modified

September 24, 2025

Overview

LLM inference1 is the process of performing inference (generate predictions) on a trained LLM

  • Large Language Model (LLM)2, pretrained as next-word predictors
    • …take a series of tokens as input
    • …generate (predict) subsequent tokens autoregressively
    • …until a stopping criteria (typically output length limitation)
  • Inference requests vary in terms of number of input and output tokens
    • …inference is embarrassingly parallel at the level of requests

Phases

Process involves two phase…

  • Prefill — Processing the input
    • Process input tokes …intermediate states (keys and values)
    • …used to generate the “first” new token
    • Resouce intensive …highly parallel matrix-matrix operation
  • Each new token depends on all the previous tokens
  • Decode — Generate the output
    • …generates output tokens autoregressively one at a time
    • …needs to know all the previous iterations’ output states (keys and values)
    • Less resource intensive (compared to prefill) …matrix-vector operation

Components

Components in an inference engine…

  • Batching — Groups multiple queued requests into single execution batch
    • …all transfered at once …reduces overhead from individual GPU calls
    • …requests share a model …memory cost of weights is spread out
    • …improves parallel processing efficiency
    • …static batching suboptimal on variable-length output …all wait for the longest
  • KV Cache (decoder phase)
    • …self-regressive text-generation outputs a token each step
    • …each tokens depends on key/value (tensor) of all input + previous tokens
    • …use of a dynamic key-value (KV) store …avoid recomputing
    • …the KV cache scales linear to the sequence length

Optimization3

  • Quantization — Reduce numerical precision
    • …reduce memory footprint & accelerates inference over FP32 (full precision)
    • Lower precision formats FP16 (half-precision), FP8, FP4 (for edge computing)
  • Continuous Batching — Dynamically adds new requests to a running GPU batch
    • Server maintains a single batch of requests from various users
    • New requests are introduced into the batch from the queue…
    • …while new requests are introduced into the batch from the queue
    • Small requests can be processed quickly, without waiting for larger requests
  • PagedAttention — Sharing of KV cache blocks for requests with similar prompts
    • Related to continuous batching …inspired by OS virtual memory
    • Fixed size block for KV cache …lookup table to map KV blocks
    • Eliminates excessive memory fragmentation in static KV cache allocation
  • Speculative Decoding — 2-model workflow to reduce inference latency
    • Draft model (smaller/quicker) generates prediction tokens before main model
    • Main model validates draft tokens until mismatch …continues generation from there

Models

ONNX4 (Open Neural Network Exchange)

  • Open source standard to represnd LLMs in a serialized file

Memory

LLM memory (VRAM) requirements defeind by…

  • Model Parameters (weights)
    • e.g. Llama 2 7B * sizeof(FP16) ~= 14GB memory
  • KV cache occupied by the self-attention tensors
    • …allocates dedicated memory for each request

Architecture

Component Description
UI Front-end for user input (web/mobile)
API Gateway Access point ot the backend service
Load Balancer Distrubute incoming request from the API gateway
Inference Server Process input data via LLM and generate output
Model Repository Stores trained LLMs
Pre Processing Prepare input for the inference model
Post Precessing Format output for the user
Database Store user interactions, model metadata

Inference Server

Inference Server — Acts as bridge between LLM and users inference requests

Inference Engine — Executes traind LLM efficiently on hardware

  • Query queue — Manages incoming requests in a queue
    • …to avoid overwhelming the model
    • …group multiple requests together into a batch
  • Model execution — Runs the LLM model
    • …allocates required hardware (typically a GPU)
    • …trained models in .pt PyTorch or .tf2 TensorFlow format
    • …model takes the batched input and generates output tokens
  • Query repsonse — Gathers model outputs
    • …splits into individual responses for each original request

Related projects:

  • Inferenca Engine
    • vLLM5
    • TensorRT-LLM6, NVIDIA
  • Inference Server
    • HuggingFace TGI7
    • SGLang8
    • DeepSpeed9
    • Trition10, NVIDIA

Load Balancing

Orchestration of multiple LLM services…

  • Llumnix11

Metrics

Performance metrics:

  • TTFT (Time to First Token)
    • Responsiveness from the user perspective
    • Time until a model begins to generate output
  • TPOT (Time per Output Token) aka ITL (Inter Token Latency)
    • Average time to generate each subsequent token
    • Overall fluidity go generation experience by the user
  • Total Time to Generation
    • Measures total time to process input and generate all tokens
    • ISL (Input Sequence Length)
    • OSL (Output Sequence Length)
  • Throughput (tokens per second)
    • Total output generation capacity
    • Measure for hardware utilizations related to output

Benchmarking needs to be aligned with workload profile…

  • Low-concurrency …few simulations users …evaluate baseline performance
  • High-concurrency …many simultaneous users …evaluate throughput scaling
  • Variable-length workloads …different prompt length to stress memory & scheduling

Maximum Batch Weight12 — Maximum allowed ‘volume’ of the batch…

  • …total number input and output tokens of all requests processed
  • …larger memory space …the more requests can be processed in parallel
  • …improves the throughput of the inference service

Questions

What is the difference between LLM training and inference?

  • Training — Teach a model to recognize patterns …make accurate predictions
    • …computationally intensive …requires expensive GPU/TPU clusters
    • …initial training cost high …one-time expense to create the model
    • …periodical retraining required to update/improve the model
  • Inference — Apply a trained model to make predictions
    • …continuously …responds in real-time to user input
    • …smaller resource requirements then training
    • …resource requirements cumulative with the number of user requests

What is the difference between serverless vs. self-hosted LLM inference?

  • Serverless — Use a provider like OpenAI13 , Antropic14
    • As simples as: Send prompt …get response …pay by the token
    • Providers typically charge based on the number of tokens processed
    • Benefits …no infrastructure required, rapid prototyping, hardware abstraction
    • Potential limits …cost, latency, customization, privacy, compliance
  • Self-hosted — On premiss deployment of an LLM inference service
    • Requires hardware infrastucture and deployment/operations expertise
    • Benefits …full control over models, deployment, costs, privacy, complianze

What is the difference between proprietary and open-source models?

  • Proprietary — GPT-4 (OpenAI), Claude (Antropic)
    • …developed by private companies
    • …often with state-of-the-art performance
    • …usually through API calls or cloud services
  • Open Source — DeepSeek-R1, Llama 4, Mistral
    • …publicly available code and weights
    • …community-driven improvements
    • …download …runs on local infrastucture
  • LMArena15 comparison of proprietary and open-source models

References