Skip to main content

Profiler

The profiler is a vLLM-based layerwise profiler. It drives a real vLLM engine with synthetic batches and records per-layer CUDA kernel latencies into per-category CSV files. Those CSVs are exactly what the simulator's trace_generator reads at run time, the profiler's output IS the simulator's input.

When you need to run it

You don't need to run the profiler if your hardware × model combination is already in the bundled profile data. Otherwise:

ScenarioProfile?
Running a bundled (hardware, model) combo (e.g., RTXPRO6000 + Llama-3.1-8B)No, just simulate
New GPU (e.g., H100, A100) with a bundled modelYes, see Adding new hardware
Bundled GPU with a new model (Mistral-7B, Phi-3.5-MoE, …)Maybe, see Adding model architecture
Non-GPU accelerator (TPU, custom NPU)Yes, but a different workflow, see Adding non-GPU hardware

What it produces

For each (hardware, model, variant) profiled, the profiler writes a folder under profiler/perf/<hardware>/<model>/<variant>/ with one tp<N>/ subfolder per profiled tensor-parallel degree:

perf/<hardware>/<model>/<variant>/
├── meta.yaml # engine flags, sweep specs, skew_fit summary
└── tp<N>/
├── dense.csv # token-count → latency
├── per_sequence.csv # seq-count → latency
├── attention.csv # 4D: (pc, kv_pre, n_dec, kv_dec) → latency
├── moe.csv # MoE only: (tokens, experts) → latency
├── skew.csv # raw heterogeneous-decode shots
└── skew_fit.csv # fitted per-bucket alpha table

Times are stored in microseconds (time_us column); the simulator multiplies by 1000 and rounds to ns at load time.

Schema details on Output bundle.

How it fits the bigger picture

The profiler runs on the vLLM Docker container (or bare metal via scripts/install-vllm.sh). The simulator runs on the simulator container (astrasim/tutorial-micro2024). They share the profiler/perf/ directory, that's the only thing they exchange.

Bundled profile data

HardwareModels profiledVariants
RTXPRO6000meta-llama/Llama-3.1-8B, Qwen/Qwen3-32B, Qwen/Qwen3-30B-A3B-Instruct-2507bf16, bf16-kvfp8

If your (hardware, model, variant) combo is in this table, you can skip the profiler entirely.

Prerequisites

  • vLLM Docker container running at /workspace (mounts repo root). See Installation → vLLM setup.
  • NVIDIA GPU (only for the profiler, the simulator runs on CPU).
  • HF_TOKEN environment variable for gated model configs (Llama 3.x, etc.). Set this in scripts/docker-vllm.sh before launching.
  • A few GB of GPU memory for the model variant you're profiling (TP=1 needs the full model; TP=N needs model_size / N).

Where to go next