Profiler

The profiler is a vLLM-based layerwise profiler. It drives a real vLLM engine with synthetic batches and records per-layer CUDA kernel latencies into per-category CSV files. Those CSVs are exactly what the simulator's trace_generator reads at run time, the profiler's output IS the simulator's input.

When you need to run it

You don't need to run the profiler if your hardware × model combination is already in the bundled profile data. Otherwise:

Scenario	Profile?
Running a bundled `(hardware, model)` combo (e.g., RTXPRO6000 + Llama-3.1-8B)	No, just simulate
New GPU (e.g., H100, A100) with a bundled model	Yes, see Adding new hardware
Bundled GPU with a new model (`Mistral-7B`, `Phi-3.5-MoE`, …)	Maybe, see Adding model architecture
Non-GPU accelerator (TPU, custom NPU)	Yes, but a different workflow, see Adding non-GPU hardware

What it produces

For each (hardware, model, variant) profiled, the profiler writes a folder under profiler/perf/<hardware>/<model>/<variant>/ with one tp<N>/ subfolder per profiled tensor-parallel degree:

perf/<hardware>/<model>/<variant>/
├── meta.yaml                       # engine flags, sweep specs, skew_fit summary
└── tp<N>/
    ├── dense.csv                   # token-count → latency
    ├── per_sequence.csv            # seq-count → latency
    ├── attention.csv               # 4D: (pc, kv_pre, n_dec, kv_dec) → latency
    ├── moe.csv                     # MoE only: (tokens, experts) → latency
    ├── skew.csv                    # raw heterogeneous-decode shots
    └── skew_fit.csv                # fitted per-bucket alpha table

Times are stored in microseconds (time_us column); the simulator multiplies by 1000 and rounds to ns at load time.

Schema details on Output bundle.

How it fits the bigger picture

The profiler runs on the vLLM Docker container (or bare metal via scripts/install-vllm.sh). The simulator runs on the simulator container (astrasim/tutorial-micro2024). They share the profiler/perf/ directory, that's the only thing they exchange.

Bundled profile data

Hardware	Models profiled	Variants
`RTXPRO6000`	`meta-llama/Llama-3.1-8B`, `Qwen/Qwen3-32B`, `Qwen/Qwen3-30B-A3B-Instruct-2507`	`bf16`, `bf16-kvfp8`

If your (hardware, model, variant) combo is in this table, you can skip the profiler entirely.

Prerequisites

vLLM Docker container running at /workspace (mounts repo root). See Installation → vLLM setup.
NVIDIA GPU (only for the profiler, the simulator runs on CPU).
HF_TOKEN environment variable for gated model configs (Llama 3.x, etc.). Set this in scripts/docker-vllm.sh before launching.
A few GB of GPU memory for the model variant you're profiling (TP=1 needs the full model; TP=N needs model_size / N).

Profiler

When you need to run it

What it produces

How it fits the bigger picture

Bundled profile data

Prerequisites

Where to go next

Running

Output bundle

Skew & alpha fit

Adding new hardware

Adding a model architecture

When you need to run it​

What it produces​

How it fits the bigger picture​

Bundled profile data​

Prerequisites​

Where to go next​

Running

Output bundle

Skew & alpha fit

Adding new hardware

Adding a model architecture

When you need to run it

What it produces

How it fits the bigger picture

Bundled profile data

Prerequisites

Where to go next