Skip to main content

Welcome to LLMServingSim

LLMServingSim is your sandbox for heterogeneous and disaggregated LLM serving infrastructure. Want to model a brand-new GPU? Sweep parallelism strategies? Throw exotic memory tiers (CXL, PIM) at a workload? Try a 32-GPU cluster you don't have? It's all within reach.

Setup is genuinely quick, clone, launch a container, compile, run. ~10 minutes. Once you're in, the simulator gets out of your way.

What you can do

Any topology, instantly

TP / PP / EP / DP+EP across multiple instances. Chunked prefill, prefix caching, MoE expert routing, KV-cache offloading, mix and match in a config file. No code changes.

Any hardware, profiled

Plug in a new GPU, CXL tier, or PIM device. The vLLM-based layerwise profiler captures real CUDA timings and feeds them straight into the simulator.

Validated against vLLM

Sub-3% error end-to-end on TTFT, TPOT, and throughput. The numbers really do reflect what production serving delivers.

Wild scenarios welcome

ShareGPT traces, agentic sessions, 10× clusters, unreleased GPUs, exotic memory tiers, run experiments you cannot easily run on real hardware.

Three steps to your first simulation

Prerequisites at a glance

A Linux host with Docker is all you need to get started.

RequiredOptional but recommended
Linux (Ubuntu 22.04+ tested)NVIDIA GPU + Container Toolkit (only for profiling new HW)
DockerHugging Face token (for gated model configs)
Git with submodule support32 GB RAM (for the vLLM benchmark side)
~12 GB free disk

Already have everything? Jump straight to Simulator setup and run your first sim in 10 minutes.

Need a hand?