Quickstart
Run your first end-to-end simulation in under a minute.
This walkthrough assumes you've finished
Installation → Simulator setup and you're
inside the simulator container at /app/LLMServingSim.
Run the example
python -m serving \
--cluster-config 'configs/cluster/single_node_single_instance.json' \
--dtype float16 --block-size 16 \
--dataset 'workloads/example_trace.jsonl' \
--output 'outputs/example_single_run.csv' \
--log-interval 1.0
That's the whole thing. The simulator will:
- Load the cluster topology from
configs/cluster/single_node_single_instance.json(a single RTXPRO6000 GPU running Llama-3.1-8B at TP=1). - Stream requests from
workloads/example_trace.jsonlaccording to their arrival times. - Step ASTRA-Sim each scheduling iteration to get cycle counts.
- Write per-request latency metrics to
outputs/example_single_run.csv.
You should see throughput, memory, and power lines printed roughly once per second. After the run finishes:
head outputs/example_single_run.csv
shows the per-request output (request id, prompt and decode tokens, TTFT, TPOT, end-to-end latency, …).
What the flags mean
| Flag | What it does |
|---|---|
--cluster-config | Cluster topology + hardware. Generates ASTRA-Sim input files automatically. |
--dtype | Model weight precision (float16, bfloat16, float32, int8). Picks the matching profile bundle. |
--block-size | KV-cache block size in tokens. Default 16. |
--dataset | JSONL file of requests (or agentic sessions). |
--output | Where to write per-request metrics. |
--log-interval | How often to print the throughput / memory / power summary line (seconds). |
The full flag list lives at Reference → CLI flags.
Try a different scenario
serving/run.sh ships a few worked examples, multi-instance,
prefill/decode disaggregation, MoE with EP, prefix caching, CXL
memory, PIM offload, and sub-batch interleaving:
./serving/run.sh
Each block in that script is self-contained and ready to copy into your own scripts. Browse the cluster configs that drive them:
ls configs/cluster/
What's next
- Simulator → Architecture overview to understand how the simulator runs internally.
- Simulator → Reading the output
to understand the metrics in
*.csv. - Workloads → JSONL format to drive the simulator with your own traces.
- Profiler overview if you want to add new hardware or models.