Validation

LLMServingSim is validated end-to-end against real vLLM on the bundled (hardware, model) combos. The numbers below come from running a 300-request ShareGPT replay through both vLLM v0.19.0 and the simulator on RTXPRO6000, then comparing the per-request and per-tick metrics with python -m bench validate.

Want to validate your own change? See For Contributors → Validating your changes for the regression workflow.

Setup

Knob	Value
Workload	300 ShareGPT-derived requests, ~10 sps Poisson arrivals
Hardware	RTXPRO6000 (single node, profile bundle in `profiler/perf/RTXPRO6000/`)
vLLM version	`v0.19.0` (the pin used by the bench container)
Block size	16
Engine flags	Defaults except where the cluster config dictates otherwise
Cluster configs	`bench/examples/configs/<model>.json`

Inputs and outputs (vLLM token IDs, sampling params, per-request timings) are pinned via bench's strict-replay path so both runs process exactly the same prompts in the same order.

Headline numbers

Mean error vs. real vLLM, per metric, on the three currently bundled configurations:

Model	Parallelism	TTFT mean	TPOT mean	Latency mean
Llama-3.1-8B	TP=1 dense	-2.8%	-0.3%	-1.0%
Qwen3-32B	TP=2 dense	-0.7%	-0.3%	-0.4%
Qwen3-30B-A3B-Instruct-2507	DP=2 × EP=2 MoE	-2.9%	+0.6%	+0.4%

Across all three, TTFT / TPOT / latency means stay within sub-3% of vLLM, and the DP+EP MoE path tracks vLLM as tightly as the dense TP path. Per-percentile numbers (P50 / P90 / P95 / P99) are in the per-model summary.txt files under bench/examples/.

Per-model results

Llama-3.1-8B (TP=1 dense)

Throughput timeline, vLLM (orange) vs. simulator (blue):

Llama-3.1-8B throughput

Headline error vs. vLLM:

Metric	vLLM	Sim	Diff
TTFT mean	7.10 s	6.90 s	-2.8%
TTFT P99	19.76 s	19.48 s	-1.4%
TPOT mean	32.5 ms	32.3 ms	-0.3%
TPOT P99	37.3 ms	37.6 ms	+0.6%
Latency mean	28.20 s	27.92 s	-1.0%
Latency P99	37.64 s	37.41 s	-0.6%

Single-instance dense Llama is the simplest configuration. The simulator slightly under-predicts TTFT (-2.8%) and tracks TPOT and end-to-end latency within half a percent.

Qwen3-32B (TP=2 dense)

Throughput timeline:

Qwen3-32B throughput

Headline error vs. vLLM:

Metric	vLLM	Sim	Diff
TTFT mean	36.91 s	36.66 s	-0.7%
TTFT P99	93.35 s	92.45 s	-1.0%
TPOT mean	80.3 ms	80.1 ms	-0.3%
TPOT P99	97.1 ms	97.0 ms	-0.1%
Latency mean	90.41 s	90.02 s	-0.4%
Latency P99	126.34 s	126.44 s	+0.1%

TP=2 exercises the dense ALLREDUCE collective on o_proj / down_proj. All means stay sub-1%; even P99s land within ~1%.

Qwen3-30B-A3B-Instruct-2507 (DP=2 × EP=2 MoE)

Throughput timeline:

Qwen3-30B-A3B-Instruct-2507 throughput

Headline error vs. vLLM:

Metric	vLLM	Sim	Diff
TTFT mean	1.09 s	1.05 s	-2.9%
TTFT P99	9.59 s	10.03 s	+4.6%
TPOT mean	47.3 ms	47.6 ms	+0.6%
TPOT P99	53.3 ms	54.3 ms	+1.9%
Latency mean	32.34 s	32.47 s	+0.4%
Latency P99	43.90 s	44.12 s	+0.5%

This is the disaggregated path: data-parallel across two instances, expert-parallel within each instance, with wave-synchronized ALLTOALL on the 2D ASTRA-Sim topology. TTFT P50 is noisier (the simulator finishes very short prefills slightly faster), but means and tail latencies align with vLLM within ~3%.

Reproducing locally

The bench module ships with reproduction scripts that re-run the simulator side and re-run the comparison against the committed vLLM artifacts:

# Sim side: writes bench/examples/<model>/outputs/sim.csv
./bench/examples/run.sh Llama-3.1-8B
./bench/examples/run.sh Qwen3-32B
./bench/examples/run.sh Qwen3-30B-A3B-Instruct-2507

# Compare: writes bench/examples/<model>/validation/{summary.txt, *.png}
./bench/examples/validate.sh Llama-3.1-8B
./bench/examples/validate.sh Qwen3-32B
./bench/examples/validate.sh Qwen3-30B-A3B-Instruct-2507

The validation step regenerates the throughput / latency / requests plots and the headline summary. To rerun vLLM itself (instead of reusing the committed artifacts under bench/examples/<model>/vllm/), use python -m bench run from inside the vLLM container; see bench/README.md for the full layout.

What's next

For Contributors → Validating your changes: the three-tier check (smoke → scenario → bench validate) you run before opening a PR, plus what regression to flag.
Simulator → Reading the output: what every column in the per-request CSV means and how to derive your own metrics from it.

Setup​

Headline numbers​

Per-model results​

Llama-3.1-8B (TP=1 dense)​

Qwen3-32B (TP=2 dense)​

Qwen3-30B-A3B-Instruct-2507 (DP=2 × EP=2 MoE)​

Reproducing locally​

What's next​

Setup

Headline numbers

Per-model results

Llama-3.1-8B (TP=1 dense)

Qwen3-32B (TP=2 dense)

Qwen3-30B-A3B-Instruct-2507 (DP=2 × EP=2 MoE)

Reproducing locally

What's next