Skip to main content

Validation

LLMServingSim is validated end-to-end against real vLLM on the bundled (hardware, model) combos. The numbers below come from running a 300-request ShareGPT replay through both vLLM v0.19.0 and the simulator on RTXPRO6000, then comparing the per-request and per-tick metrics with python -m bench validate.

Want to validate your own change? See For Contributors → Validating your changes for the regression workflow.

Setup

KnobValue
Workload300 ShareGPT-derived requests, ~10 sps Poisson arrivals
HardwareRTXPRO6000 (single node, profile bundle in profiler/perf/RTXPRO6000/)
vLLM versionv0.19.0 (the pin used by the bench container)
Block size16
Engine flagsDefaults except where the cluster config dictates otherwise
Cluster configsbench/examples/configs/<model>.json

Inputs and outputs (vLLM token IDs, sampling params, per-request timings) are pinned via bench's strict-replay path so both runs process exactly the same prompts in the same order.

Headline numbers

Mean error vs. real vLLM, per metric, on the three currently bundled configurations:

ModelParallelismTTFT meanTPOT meanLatency mean
Llama-3.1-8BTP=1 dense-2.8%-0.3%-1.0%
Qwen3-32BTP=2 dense-0.7%-0.3%-0.4%
Qwen3-30B-A3B-Instruct-2507DP=2 × EP=2 MoE-2.9%+0.6%+0.4%

Across all three, TTFT / TPOT / latency means stay within sub-3% of vLLM, and the DP+EP MoE path tracks vLLM as tightly as the dense TP path. Per-percentile numbers (P50 / P90 / P95 / P99) are in the per-model summary.txt files under bench/examples/.

Per-model results

Llama-3.1-8B (TP=1 dense)

Throughput timeline, vLLM (orange) vs. simulator (blue):

Llama-3.1-8B throughput

Headline error vs. vLLM:

MetricvLLMSimDiff
TTFT mean7.10 s6.90 s-2.8%
TTFT P9919.76 s19.48 s-1.4%
TPOT mean32.5 ms32.3 ms-0.3%
TPOT P9937.3 ms37.6 ms+0.6%
Latency mean28.20 s27.92 s-1.0%
Latency P9937.64 s37.41 s-0.6%

Single-instance dense Llama is the simplest configuration. The simulator slightly under-predicts TTFT (-2.8%) and tracks TPOT and end-to-end latency within half a percent.

Qwen3-32B (TP=2 dense)

Throughput timeline:

Qwen3-32B throughput

Headline error vs. vLLM:

MetricvLLMSimDiff
TTFT mean36.91 s36.66 s-0.7%
TTFT P9993.35 s92.45 s-1.0%
TPOT mean80.3 ms80.1 ms-0.3%
TPOT P9997.1 ms97.0 ms-0.1%
Latency mean90.41 s90.02 s-0.4%
Latency P99126.34 s126.44 s+0.1%

TP=2 exercises the dense ALLREDUCE collective on o_proj / down_proj. All means stay sub-1%; even P99s land within ~1%.

Qwen3-30B-A3B-Instruct-2507 (DP=2 × EP=2 MoE)

Throughput timeline:

Qwen3-30B-A3B-Instruct-2507 throughput

Headline error vs. vLLM:

MetricvLLMSimDiff
TTFT mean1.09 s1.05 s-2.9%
TTFT P999.59 s10.03 s+4.6%
TPOT mean47.3 ms47.6 ms+0.6%
TPOT P9953.3 ms54.3 ms+1.9%
Latency mean32.34 s32.47 s+0.4%
Latency P9943.90 s44.12 s+0.5%

This is the disaggregated path: data-parallel across two instances, expert-parallel within each instance, with wave-synchronized ALLTOALL on the 2D ASTRA-Sim topology. TTFT P50 is noisier (the simulator finishes very short prefills slightly faster), but means and tail latencies align with vLLM within ~3%.

Reproducing locally

The bench module ships with reproduction scripts that re-run the simulator side and re-run the comparison against the committed vLLM artifacts:

# Sim side: writes bench/examples/<model>/outputs/sim.csv
./bench/examples/run.sh Llama-3.1-8B
./bench/examples/run.sh Qwen3-32B
./bench/examples/run.sh Qwen3-30B-A3B-Instruct-2507

# Compare: writes bench/examples/<model>/validation/{summary.txt, *.png}
./bench/examples/validate.sh Llama-3.1-8B
./bench/examples/validate.sh Qwen3-32B
./bench/examples/validate.sh Qwen3-30B-A3B-Instruct-2507

The validation step regenerates the throughput / latency / requests plots and the headline summary. To rerun vLLM itself (instead of reusing the committed artifacts under bench/examples/<model>/vllm/), use python -m bench run from inside the vLLM container; see bench/README.md for the full layout.

What's next