Changelog | LLMServingSim

All notable changes to this project are documented in this file. This project follows Keep a Changelog conventions.

[v1.1.0] - 2026-04-26

Added

New vLLM-based layerwise profiler (profiler/) replacing the old llm_profile/ module. Uses vLLM's built-in layerwise_profile() via a worker extension class to capture per-layer CUDA kernel timings from real vLLM execution paths. Architecture is dispatched by the HF config's model_type against YAML catalogs under profiler/models/, and each run emits a per-category CSV bundle (dense.csv, per_sequence.csv, attention.csv, and moe.csv for MoE) under perf/<hw>/<model>/<variant>/tp<N>/, with latencies in microseconds. The base layerwise-profile methodology — driving a real vLLM engine via a worker extension class and emulating TP=N on a single GPU by sharding hf_overrides — is adapted from @waneon.
Unified 4D attention profiling (attention.csv) replacing the earlier prefill/decode-separated scheme with a single table over prefill_chunk × kv_prefill × n_decode × kv_decode that matches what vLLM's chunked-prefill scheduler actually produces each step. Geometric axes with ATTENTION_CHUNK_FACTOR / ATTENTION_KV_FACTOR (default 2.0 = doubling) tune density against profile time
Skew profiling + 5-axis alpha fit for heterogeneous-decode attention (profiler/core/skew.py, fit_alpha.py). The sweep fires bimodal decode batches and measures (t_mean, t_max, t_skew) per case; fit_alpha then groups rows by a 5-axis key pc | n_label | skew_rate_label | kv_big_label | kp_label and runs weighted least-squares per cell. At query time the simulator blends two uniform-attention lookups via the fitted alpha to recover the FlashAttention tile-padding / SM-imbalance penalty the uniform grid can't see (serving/core/trace_generator.py _lookup_attention_with_skew / _skew_alpha). Axis ablation on the widened ~13k-sample dataset picked the 5-axis scheme over the earlier 3-axis fit (test p50/p90 ≈ 2.7% / 14.8% vs 3.5% / 16.4% on TP=1)
Data-derived bucket axes for the skew fit. n and kp buckets are one per unique profiled value (+ kp=0 sentinel + overflow); kv_big uses log-4x bins adapted to the observed max; skew_rate is a fixed normalised [0, 1] scheme; pc is keyed raw. Derived axes are written to meta.yaml::skew_fit.bucket_axes and the simulator reads them from there, so widening MAX_NUM_SEQS or ATTENTION_MAX_KV lights up finer resolution without any simulator code change
Per-axis skew density knobs: SKEW_N_FACTOR / SKEW_PC_FACTOR / SKEW_KP_FACTOR / SKEW_KVS_FACTOR (CLI: --skew-*-factor, default 2.0 = doubling). Crank higher to coarsen a given axis and cut profile time; effective values land in meta.yaml::skew_profile.factors
Per-TP skew_fit.csv file spills the full per-bucket alpha table out of meta.yaml so the latter stays readable (~100 lines vs ~3100 lines for Qwen3-32B at 2 TPs). meta.yaml::skew_fit.per_tp[tp].bucket_table points at tp<N>/skew_fit.csv; the simulator hydrates it back into alpha_by_bucket on _load_perf_db()
Compact attention_grid / skew_profile grid specs in meta.yaml (e.g. "0, 16-2048 x2" instead of the full value list)
RTXPRO6000 (NVIDIA RTX PRO 6000 Blackwell) hardware support: 96 GB, 1597 GB/s, 600W TDP
DP+EP (Data Parallel + Expert Parallel) support with ASTRA-Sim ALLTOALL synchronization via involved_dim dimension scoping. Instances with the same dp_group share a single ASTRA-Sim process; the 2D topology [tp_size, dp_group_size] enables per-dimension collective routing (ALLREDUCE on TP dim, ALLTOALL on DP dim)
Wave synchronization for DP groups: Python-side dp_pending barrier ensures all instances schedule before trace generation. ALLTOALL comm_size synchronized to max(total_len) across the group. Dummy batches keep idle instances participating in ALLTOALL sync
single_node_moe_dp_ep_instance.json cluster config for MoE with DP+EP (2 instances, TP=1, EP=2, same DP group)
Agentic session support for closed-loop workloads (e.g., SWE-bench). The new JSONL format uses sub_requests arrays with tool_duration_ns to model dependency chains where each LLM call waits for the previous one to complete plus tool execution time. The router dynamically releases sub-requests as their predecessors finish, enabling accurate simulation of multi-step agentic workflows
--num-reqs CLI argument (replaces --num-req), default changed from 100 to 0 (load all entries from dataset). For agentic datasets, counts sessions not sub-requests
Example SWE-bench agentic dataset (workloads/swe-bench-qwen3-30b-a3b-50-sps0.2.jsonl)
Qwen3-32B and Qwen3-30B-A3B-Instruct-2507 model configs with explicit head_dim support for models where head_dim != hidden_size // num_attention_heads
FP8 KV cache simulation support (--kv-cache-dtype fp8): selects profile_fp8.csv for compute latency lookup and halves KV cache memory usage in the memory model
FP8 KV cache profiling support (kv_cache_dtype: "fp8" in receipts, outputs profile_fp8.csv)
Chunked prefill support (enabled by default, matching vLLM v1) with --long-prefill-token-threshold for per-request token cap per step (chunked prefill core by @HyunsuYEE)
Chunked prefill compatible with prefix caching (RadixAttention)
Prefix cache lock tracking (_prefix_locked) to prevent incorrect eviction during multi-chunk prefill
Non-Docker vLLM installer (scripts/install-vllm.sh) using uv with precompiled vLLM 0.19.0 wheels (@junwha)
End-to-end vLLM benchmark + simulator validation suite (bench/, invoked as python -m bench {run,validate}). bench run replays a workload through a real vLLM AsyncLLM engine with output_toks pinned via SamplingParams(min_tokens=N, max_tokens=N, ignore_eos=True) so results are bit-for-bit comparable to the simulator's view of the same dataset. A custom vllm.v1.metrics.loggers.StatLoggerBase writes per-tick scheduler / iteration stats; RequestStateStats from vllm.v1.metrics.stats lands in requests.jsonl. bench validate loads a finished run plus the simulator's sim.csv / sim.log and emits throughput, running/waiting, and TTFT/TPOT/latency-CDF plots plus a numeric diff% summary
Workload generators (workloads/generators/, invoked as python -m workloads.generators sharegpt …). Multi-turn ShareGPT parser with running context accumulation; default source shibing624/sharegpt_gpt4. Runs in tokenizer-only mode by default (output IDs from the assistant turn) or with --use-vllm to drive an offline batched vllm.LLM for free-generated outputs at maximum throughput. Optional --fix-len (random fixed-length tokens) and --pulse (bursty arrivals) modes
Per-model invocation templates under workloads/examples/ (gen-llama-3.1-8b.sh, gen-qwen3-30b-a3b.sh, gen-qwen3-32b.sh)
Module READMEs for bench/, scripts/ (top-level wrappers for the vLLM and simulator container launchers, the bare-metal vLLM installer, and the ASTRA-Sim build)
Rich-backed logger shared between simulator, profiler, and bench (serving/core/logger.py, profiler/core/logger.py, bench/core/logger.py). Keeps the original [HH:MM:SS.mmm] [Component] [node=X,inst=Y] LEVEL msg line shape via a custom _RichSimHandler (public API unchanged — configure_logger / get_logger / the ComponentLoggerAdapter still work for every existing call site) and adds:
- .success() (green ✓ at INFO) and .summary() (verbatim, no prefix) on the adapter, plus module-level print_banner() / print_input_config() / print_markup() / print_rule() and stage(title) / progress(label, total) context managers mirroring the profiler's helpers.
- Rich theme + soft_wrap=True so colour renders in interactive terminals, long lines stay on one logical row, and redirected files (> out.log, nohup …) get clean plain-text logs with no stray ANSI escape bytes. FORCE_COLOR=1 still forces colour when an IDE terminal doesn't self-identify as a TTY.
- Banner / logo / input-config / simulation-results blocks in serving/__main__.py migrated to the new helpers (with bench/__main__.py using the same banner / stage / progress conventions); heartbeat status tree (├─ / └─) now builds each line as a string and emits via Rich markup for consistent colouring.
- RadixCache.format_prefix_info(), Scheduler.print_result(), and PowerModel.print_power_summary() rewritten around the new helpers. serving/utils.py loses its ANSI colour wrappers (cyan / bold / ANSI_* / …) and the logo / input-config renderers now live in logger.py
READMEs for configs/model/, configs/pim/, workloads/, serving/
.gitignore entries for AI agent cache files (.claude/, .cursor/, .copilot/, .codex/, .aider*, .continue/)

Fixed

Skew sweep feasibility filter used strict n_reqs >= max_num_seqs and dropped every n = MSQ case (including the pure-decode corner the attention sweep was already allowing). Relaxed to > to match attention and unlock pure n = MSQ shots. Mixed-regime n = MSQ (requires MSQ+1 requests) still filtered; profile with MAX_NUM_SEQS one above runtime MSQ to cover that corner too
Missing prefix_match call on non-chunked prefill path: prefix cache hits were not detected for full prefill requests, preventing prefix caching benefits when chunked prefill was disabled (@junwha)
Typo in timer reference in legacy Mixtral profiler model (@junwha)
Prompt throughput now includes prefix cache hit tokens. Previously only actually computed prefill tokens were counted, making throughput appear lower than vLLM's reported prompt throughput when prefix caching was active
Prefix cache is_init never cleared for full prefix cache hits, causing total_requested_tokens to inflate on every decode step and lock_ref leaks
Prefix cache lock_prefix not called for full prefix hits, causing memory leaks at simulation end
MoE expert latency aggregated both EP ranks onto one GPU (2x overestimate); now each GPU uses only its own rank's tokens and activated experts
MoE weight calculation in memory_model.py now uses ep_size (not tp_size) for expert weight sharding
Status print timing: only prints on start NPU to avoid transient "0 running" states
system.json collective implementations now match topology dimensions (2 entries for 2D topologies) — previously 1 entry caused ASTRA-Sim to create only 1 dimension
DP group termination: instances wait for all DP members to finish before marking done
argparse allow_abbrev=False to prevent silent prefix matching of wrong arguments
Add missing return parser.parse_args() in legacy profiler layers/main.py (reported and fixed by @junwha, @gleb-kun)

Changed

--fp flag replaced with --dtype (vLLM-style: float16, bfloat16, float32, int8)
--gen flag replaced with --skip-prefill for clarity
--request-routing-policy default changed from RR to LOAD (vLLM-style weighted least-loaded). Requests are now routed in real-time based on current system state instead of upfront assignment
--expert-routing-policy FAST renamed to COPY for clarity (enables block copy)
Cluster config: npu_num/npu_group replaced with tp_size/pp_size/ep_size/dp_group. Partial configs supported (e.g., num_npus=4, tp_size=2 infers pp_size=2). TP and EP share the same GPU set; DP via multiple instances with same dp_group
MoE modeling: per-EP-rank latency lookup (key_0=local_tokens, key_1=activated_experts), even expert-to-rank partitioning, ASTRA-Sim ALLTOALL with involved_dim for cross-DP sync
MoE calculate_sizes: uses moe_intermediate_size (per-expert FFN dim) separate from intermediate_size (dense FFN dim)
calculate_sizes parameter renamed: tp → parallel (generic for TP or EP)
Trace comm_type now supports dimension scoping: ALLREDUCE:1,0, ALLTOALL:0,1
Network topology for DP groups: npus_count: [tp_size, dp_group_size] with per-dimension collective implementations in system.json
Removed analytical ALLTOALL workaround functions (_inflate_comm_size, _ring_alltoall_time_ns, _bw_gb_to_bpns) — replaced by native ASTRA-Sim ALLTOALL
link_bw/link_latency removed from TraceCtx and generate_trace (no longer needed for analytical fallback)
Latency lookup extrapolates beyond profiled range instead of clamping for improved accuracy on large batch sizes
Profiler rewritten from PyTorch Profiler + scikit-learn predictor to direct vLLM layerwise_profile() approach. Architecture yamls live in profiler/models/ keyed on the HF config's model_type; CLI flags match vLLM (--dtype, --kv-cache-dtype, --max-num-batched-tokens, --max-num-seqs, --tp, --variant). Docker pinned to vLLM v0.19.0 (vllm/vllm-openai:v0.19.0 or v0.19.0-cu130 for CUDA 13.x)
Old profiler preserved under profiler/v0/ for reference
Layer names unified between profiler and simulator: qkv_projection, o_projection, ffn1, ffn2, attention, layernorm (old names removed)
memory_model.py updated to use explicit head_dim and q_dim/kv_dim for correct tensor size computation on models like Qwen3
trace_generator.py rewritten with composable helpers (TraceCtx, BatchCtx, _emit_layer, _emit_pre_attn_layers, _emit_post_attn_layers) and unified profile CSV lookup with 2D bilinear interpolation
Sampler output location changed to REMOTE (was on lm_head) to match Chakra converter's MEM_STORE node placement
Removed --enable-attn-prediction flag (scikit-learn predictor replaced by direct profiled latency lookup)
Cluster configs updated to RTXPRO6000 hardware specs
AGENTS.md expanded with full repo structure, simulation flow, trace format documentation, and additional pitfalls
--max-batch renamed to --max-num-seqs (default: 128, matching vLLM); now limits total running requests across inflight batches
--enable-chunked-prefill now enabled by default (matching vLLM v1); use --no-enable-chunked-prefill to disable
--enable-prefix-caching now enabled by default (matching vLLM v1); use --no-enable-prefix-caching to disable
Scheduler rewritten to use vLLM-style token-budget-based allocation for both chunked and non-chunked prefill paths (schedule_base, schedule_with_prefix)
KV cache block allocation uses vLLM-style cumulative ceiling division
Radix tree cache_unfinished_req now uses num_computed_tokens instead of req.input, enabling correct incremental caching across chunks
Prefix cache memory accounting changed to free-before-allocate order
Hash-to-length map in memory_model.py changed from {hash: tlen} to {hash: [tlen, refcount]} to handle duplicate block hashes
All Request attributes now properly initialized in __init__; removed getattr fallbacks throughout scheduler and radix tree
Directory restructuring:
- cluster_config/ → configs/cluster/
- model_config/ → configs/model/
- pim_config/ → configs/pim/
- dataset/ → workloads/ (the directory holds ShareGPT-style request workloads consumed by the simulator and bench)
- output/ → outputs/
- script/ → scripts/
- llm_profile/ → profiler/legacy_profiler/ (later moved to profiler/v0/)
Top-level package layout finalized as Python-style sibling modules:
- inference_serving/ → serving/ with internals under serving/core/ (every .py previously at the package root now lives one directory deeper); entrypoint main.py becomes serving/__main__.py and is invoked as python -m serving ….
- llm_profiler/ → profiler/ (collapses the duplicated llm_profiler/profiler/ package layer) with internals under profiler/core/ and profiler/core/hooks/.
- bench/ added with the same shape (bench/core/).
- workloads/ ships the ShareGPT generator under workloads/generators/sharegpt.py (invoked as python -m workloads.generators sharegpt …) with per-model invocation templates under workloads/examples/. The package deliberately avoids the name datasets/ so the HuggingFace datasets library imports cleanly.
- Module-specific shell scripts live at the module home (e.g. profiler/profile.sh, bench/bench.sh, serving/run.sh); only cross-cutting environment / build helpers stay in scripts/ (docker-vllm.sh, docker-sim.sh, install-vllm.sh, compile.sh).
Evaluation configs moved from config/ to configs/ subdirectories within each figure folder
run.sh updated with reorganized examples and commented out unavailable MoE config

Removed

internal/ directory (debug docs and scheduler tests moved or removed)
scripts/ batch experiment scripts (superseded by run.sh examples)
evaluation/ directory (preserved on ispass26-artifact branch)
--enable-attn-prediction flag and scikit-learn attention predictor
--fp flag (replaced by --dtype)
--gen flag (replaced by --skip-prefill)
--expert-routing-policy FAST (renamed to COPY)
serving/attn_utils.py (stale scikit-learn attention feature helper)
npu_num/npu_group config fields (replaced by tp_size/pp_size/ep_size)
--num-req flag (replaced by --num-reqs)
Analytical ALLTOALL workaround functions (_inflate_comm_size, _ring_alltoall_time_ns)
evaluation/ directory (preserved on ispass26-artifact branch)

[v1.0.0] - 2026-02-25

Added

Multi-instance simulation with configurable request routing policies (Round Robin, Random, Custom)
Prefill/Decode (P/D) disaggregation support across instances
Mixture of Experts (MoE) support with expert parallelism, expert offloading, and configurable routing policies (Round Robin, Random, Fast, Custom)
Prefix caching using RadixAttention (based on SGLang), with support for second-tier prefix cache pooling across CPU and CXL memory (--enable-prefix-caching, --enable-prefix-sharing)
Sub-batch interleaving to overlap prefill and decode phases within an iteration (--enable-sub-batch-interleaving)
Attention latency predictor using scikit-learn for real-time per-request estimation (--enable-attn-prediction)
Power and energy modeling per node covering NPU, CPU, DRAM, interconnect, NIC, and storage
CXL memory expansion support with configurable bandwidth and latency
Enhanced PIM (Processing-In-Memory) model with per-device INI configuration (configs/pim/)
Cluster-level configuration system (configs/cluster/*.json) that consolidates all hardware, topology, and placement parameters into a single file
Per-layer weight, KV cache, and expert placement rules in cluster config
Additional latency metrics: ITL (Inter-Token Latency) and p99 for TTFT, TPOT, ITL
Hardware performance profiles for TPU-v6e-1
Batch experiment scripts for systematic evaluation (scripts/)
Artifact evaluation scripts and reference results (evaluation/)
llm_profile integrated as a local module with support for MoE models and power profiling

Changed

All hardware and topology parameters are now specified via cluster_config JSON files; per-invocation hardware arguments (--model_name, --hardware, --npu_num, etc.) are removed
Command-line argument style changed from underscore to hyphen (e.g., --cluster-config, --num-req, --block-size)
Dataset format changed from .tsv to .jsonl
Build process consolidated into ./compile.sh and ./docker.sh
Performance model directory relocated from perf_model/ to llm_profile/perf_models/
serving/ modules renamed for clarity:
- control.py → controller.py
- generate_graph.py → graph_generator.py
- generate_trace.py → trace_generator.py
- config_generator.py → config_builder.py
- pim.py → pim_model.py
Fix incorrect evict_size accumulation

Removed

trace_test/ directory (superseded by evaluation/ scripts)
Direct per-invocation hardware arguments (--model_name, --hardware, --npu_num, --npu_group, --npu_mem, --remote_bw, --link_bw)

[v0.2.1] - 2025-07-18

Added

llm_profile module with PyTorch Profiler for GPU layer and attention latency measurement
Llama-3.1-8B-Instruct model support (replaces GPT-3 6.7B as the default model)
Hugging Face model configuration support for easy addition of new models

Changed

Function names standardized to snake_case (e.g., createNetworkConfig → create_network_config, calculateSizes → calculate_sizes)
Model configuration files updated to Llama-3.1-8B-Instruct format

Fixed

Collective operation stall caused by unresolved dependencies in the ASTRA-Sim workload graph
Network dimension calculation for full pipeline parallelism (npus_per_dim formula corrected)

[v0.2.0] - 2025-06-04

Changed

ASTRA-Sim submodule updated to latest version (branch v0.2.0)
Chakra updated to latest version
Network configuration format changed from JSON to YAML
local_bw and remote_bw parameters replaced with link_latency
Conda environment dependencies updated and simplified

[v0.1.0] - 2025-01-03

Added

GPU performance model based on TensorRT-LLM profiling (replaces NPU simulator)
Auto config generator for network and memory configurations
New parameters: --hardware, --local_bw, --remote_bw, --link_bw, --fp
Additional metrics: queuing_delay, TTFT, TPOT
Verbose logging option for detailed execution output

Changed

ASTRA-Sim submodule branch updated from artifact to v0.1.0
Output format changed from TSV to CSV

Removed

Polymath and codelets_src submodules (NPU simulator components replaced by performance model)

[artifact] - 2024-06-23

Added

Initial project release as IISWC 2024 artifact: "LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale"
NPU simulator-based co-simulation infrastructure (ASTRA-Sim + Polymath + codelets_src)
Evaluation scripts and benchmark results
Conda environment configuration (environment.yml)

[v1.1.0] - 2026-04-26​

Added​

Fixed​

Changed​

Removed​

[v1.0.0] - 2026-02-25​

Added​

Changed​

Removed​

[v0.2.1] - 2025-07-18​

Added​

Changed​

Fixed​

[v0.2.0] - 2025-06-04​

Changed​

[v0.1.0] - 2025-01-03​

Added​

Changed​

Removed​

[artifact] - 2024-06-23​

Added​

[v1.1.0] - 2026-04-26

Added

Fixed

Changed

Removed

[v1.0.0] - 2026-02-25

Added

Changed

Removed

[v0.2.1] - 2025-07-18

Added

Changed

Fixed

[v0.2.0] - 2025-06-04

Changed

[v0.1.0] - 2025-01-03

Added

Changed

Removed

[artifact] - 2024-06-23

Added