Skip to main content

Troubleshooting

Common errors during install and first run, with the quickest fix.

If your issue isn't here, please file a bug at github.com/casys-kaist/LLMServingSim/issues with the full command, the error output, and your OS / Docker / GPU versions.

Submodules are missing

Symptom: Build fails with errors about missing files under astra-sim/extern/graph_frontend/chakra/ or astra-sim/build/.

Cause: You cloned without --recurse-submodules.

Fix:

git submodule update --init --recursive

Then re-run ./scripts/compile.sh.

docker: permission denied

Symptom:

docker: Got permission denied while trying to connect to the
Docker daemon socket

Cause: Your user isn't in the docker group.

Fix:

sudo usermod -aG docker $USER
newgrp docker
# or log out and back in

GPU not detected

Symptom: Inside the vLLM container, nvidia-smi says command not found or no devices found.

Cause: NVIDIA Container Toolkit isn't installed or Docker isn't configured to use it.

Fix: install / re-configure the toolkit (see Prerequisites) and restart Docker:

sudo systemctl restart docker

Then verify:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If the host's nvidia-smi works but the container's doesn't, the toolkit is the problem. If the host's nvidia-smi fails too, install the NVIDIA driver first.

Hugging Face: gated model / 401 / 403

Symptom: When profiling a Llama 3.x or gated Qwen variant:

huggingface_hub.utils._errors.GatedRepoError: Access to model
meta-llama/Llama-3.1-8B is restricted...

Fix:

  1. Accept the license on the model page (one-time, on huggingface.co).

  2. Set HF_TOKEN in your shell before launching the vLLM container:

    export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxx"
    ./scripts/docker-vllm.sh

The token gets forwarded into the container automatically. Confirm with echo $HF_TOKEN inside the container.

ASTRA-Sim build fails

Symptom: ./scripts/compile.sh errors out partway through, often with a CMake or compiler message.

Common causes & fixes:

  • Missing build deps inside the container. The official astrasim/tutorial-micro2024 image has them by default. If you customized the image, ensure cmake, g++, protobuf-compiler, libprotobuf-dev, and libboost-dev are installed.

  • Stale build state. Wipe the build directories and retry:

    rm -rf astra-sim/build/astra_analytical/build/
    ./scripts/compile.sh
  • Outside the container. compile.sh is meant to run inside the simulator container, not on the host. Use ./scripts/docker-sim.sh first.

Container name already in use

Symptom:

docker: Error response from daemon: Conflict. The container name
"/servingsim_docker" is already in use by container "abc123..."

Cause: A previous run left the container around.

Fix: either re-attach or remove and recreate.

# re-attach to existing
docker start -ai servingsim_docker

# or wipe and recreate
docker rm -f servingsim_docker
./scripts/docker-sim.sh

Same idea for vllm_docker.

Missing profile data

Symptom: Running the simulator with a hardware / model combination that doesn't have profile data:

FileNotFoundError: ../profiler/perf/<hardware>/<model>/<variant>/tp1/dense.csv

Cause: The (hardware, model, dtype, kv_cache_dtype) tuple doesn't have a profiled CSV bundle.

Fix: either

--max-num-batched-tokens warning at startup

Symptom:

WARNING: runtime --max-num-batched-tokens (4096) exceeds profiled
sweep bound (2048). Lookups will extrapolate.

Cause: You're running the simulator with a token budget larger than the one the profiler swept. Latency lookups will linearly extrapolate past the measured range.

Fix:

  • For best accuracy, re-profile at the higher --max-num-batched-tokens (MAX_NUM_BATCHED_TOKENS=4096 ./profiler/profile.sh).
  • Or stay at the profiled bound. Extrapolation is usually fine for small overshoots; large ones can drift.

Simulator stuck / very slow on big workloads

Symptom: Simulation runs but takes much longer than expected, especially with MoE + EP or large prefix caches.

Common causes & fixes:

  • Block-copy disabled. For MoE, set --expert-routing-policy COPY (the default). RR and RAND are slower because they touch ASTRA-Sim per token rather than per block.
  • Verbose logging. --log-level DEBUG writes a lot. Drop to --log-level INFO or WARNING.
  • --log-interval too small. Setting it to 0.1 makes the logger run every 100 ms; raise to 1.0 (default) or higher.

Out of memory inside the vLLM container

Symptom: Profiler crashes with CUDA OOM partway through the attention sweep.

Fix: lower MAX_NUM_BATCHED_TOKENS in profiler/profile.sh, or skip the heavy categories with environment variables (see Profiler → Running).

Still stuck?

When you file a bug, please include:

  1. The exact command you ran
  2. The full error output
  3. Your OS, Docker version, NVIDIA driver, GPU model
  4. Whether you're inside the simulator container or the vLLM container (or bare metal)