vLLM setup
This step is optional. You only need the vLLM environment if you plan to:
- Profile a new GPU or model (
python -m profiler) - Benchmark real vLLM end-to-end (
python -m bench run) - Validate the simulator against vLLM (
python -m bench validate) - Generate ShareGPT-style workload datasets
(
python -m workloads.generators)
If you only want to run pre-profiled simulations on the bundled hardware (RTXPRO6000), skip this page.
Choose an install method
- Docker (recommended)
- Bare metal (uv venv)
The Docker path uses the official vllm/vllm-openai:v0.19.0 image,
which already contains vLLM, PyTorch, and all CUDA dependencies.
1. (Optional) Set HF_TOKEN
Some model configs (Llama 3.x, gated Qwen variants) need a Hugging Face token to auto-fetch:
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxx"
Skip this if you only profile open models or use locally-stored configs.
2. Launch the vLLM container
From the repo root:
./scripts/docker-vllm.sh
This:
- Requests all GPUs via
--gpus all - Forwards
HF_TOKENfrom your shell - Mounts the repo root to
/workspace(sopython -m profiler,python -m bench,python -m workloads.generatorsall work) - Mounts
~/.cache/huggingfaceto share the model cache with the host - Sets
--shm-size=16g(vLLM needs this for inter-process tensors) - Pre-installs
datasetsandmatplotlib(extra deps the profiler and bench plots need) - Drops you into
bashat/workspace
The container is named vllm_docker. Re-attach later with:
docker start -ai vllm_docker
3. Verify the install
Inside the container:
python -c "import vllm; print(vllm.__version__)"
nvidia-smi
You should see vLLM 0.19.0 and your GPU listed.
For environments without Docker, install vLLM into a local uv venv.
1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
2. Run the installer
From the repo root:
./scripts/install-vllm.sh
This:
- Creates a local
uvvenv with Python 3.12 - Installs
vllm==0.19.0(withVLLM_USE_PRECOMPILED=1for the prebuilt CUDA wheels) - Adds
datasetsandmatplotlibfor the workload generator and bench plots
Activate the venv before running profiler / bench:
source .venv/bin/activate
3. Verify the install
python -c "import vllm; print(vllm.__version__)"
nvidia-smi
You should see vLLM 0.19.0 and your GPU.
Bare-metal caveats
CUDA driver mismatch is the most common failure mode. vLLM 0.19.0
needs a CUDA 13.x-compatible driver. Verify with nvidia-smi.
The VLLM_USE_PRECOMPILED=1 flag tells uv to skip building vLLM
from source. If your CUDA version doesn't match the prebuilt wheel,
drop the flag and accept the longer build.
Outside Docker you're responsible for HF_TOKEN,
~/.cache/huggingface, and shm size yourself.
What's next
- Profiler guide: capture per-layer CUDA kernel timings into the per-category CSV bundle the simulator consumes.
- bench/ on GitHub, run vLLM end-to-end and validate the simulator against ground truth (TTFT / TPOT / throughput / running-waiting plots).