Skip to main content

vLLM setup

This step is optional. You only need the vLLM environment if you plan to:

  • Profile a new GPU or model (python -m profiler)
  • Benchmark real vLLM end-to-end (python -m bench run)
  • Validate the simulator against vLLM (python -m bench validate)
  • Generate ShareGPT-style workload datasets (python -m workloads.generators)

If you only want to run pre-profiled simulations on the bundled hardware (RTXPRO6000), skip this page.

Choose an install method

The Docker path uses the official vllm/vllm-openai:v0.19.0 image, which already contains vLLM, PyTorch, and all CUDA dependencies.

1. (Optional) Set HF_TOKEN

Some model configs (Llama 3.x, gated Qwen variants) need a Hugging Face token to auto-fetch:

export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxx"

Skip this if you only profile open models or use locally-stored configs.

2. Launch the vLLM container

From the repo root:

./scripts/docker-vllm.sh

This:

  • Requests all GPUs via --gpus all
  • Forwards HF_TOKEN from your shell
  • Mounts the repo root to /workspace (so python -m profiler, python -m bench, python -m workloads.generators all work)
  • Mounts ~/.cache/huggingface to share the model cache with the host
  • Sets --shm-size=16g (vLLM needs this for inter-process tensors)
  • Pre-installs datasets and matplotlib (extra deps the profiler and bench plots need)
  • Drops you into bash at /workspace

The container is named vllm_docker. Re-attach later with:

docker start -ai vllm_docker

3. Verify the install

Inside the container:

python -c "import vllm; print(vllm.__version__)"
nvidia-smi

You should see vLLM 0.19.0 and your GPU listed.

What's next

  • Profiler guide: capture per-layer CUDA kernel timings into the per-category CSV bundle the simulator consumes.
  • bench/ on GitHub, run vLLM end-to-end and validate the simulator against ground truth (TTFT / TPOT / throughput / running-waiting plots).