All notable changes to this project are documented in this file. This project follows Keep a Changelog conventions.
[v1.1.0] - 2026-04-26
Added
- New vLLM-based layerwise profiler (
profiler/) replacing the oldllm_profile/module. Uses vLLM's built-inlayerwise_profile()via a worker extension class to capture per-layer CUDA kernel timings from real vLLM execution paths. Architecture is dispatched by the HF config'smodel_typeagainst YAML catalogs underprofiler/models/, and each run emits a per-category CSV bundle (dense.csv,per_sequence.csv,attention.csv, andmoe.csvfor MoE) underperf/<hw>/<model>/<variant>/tp<N>/, with latencies in microseconds. The base layerwise-profile methodology — driving a real vLLM engine via a worker extension class and emulating TP=N on a single GPU by shardinghf_overrides— is adapted from @waneon. - Unified 4D attention profiling (
attention.csv) replacing the earlier prefill/decode-separated scheme with a single table overprefill_chunk × kv_prefill × n_decode × kv_decodethat matches what vLLM's chunked-prefill scheduler actually produces each step. Geometric axes withATTENTION_CHUNK_FACTOR/ATTENTION_KV_FACTOR(default 2.0 = doubling) tune density against profile time - Skew profiling + 5-axis alpha fit for heterogeneous-decode attention
(
profiler/core/skew.py,fit_alpha.py). The sweep fires bimodal decode batches and measures(t_mean, t_max, t_skew)per case;fit_alphathen groups rows by a 5-axis keypc | n_label | skew_rate_label | kv_big_label | kp_labeland runs weighted least-squares per cell. At query time the simulator blends two uniform-attention lookups via the fitted alpha to recover the FlashAttention tile-padding / SM-imbalance penalty the uniform grid can't see (serving/core/trace_generator.py_lookup_attention_with_skew/_skew_alpha). Axis ablation on the widened ~13k-sample dataset picked the 5-axis scheme over the earlier 3-axis fit (test p50/p90 ≈ 2.7% / 14.8% vs 3.5% / 16.4% on TP=1) - Data-derived bucket axes for the skew fit.
nandkpbuckets are one per unique profiled value (+kp=0sentinel + overflow);kv_biguses log-4x bins adapted to the observed max;skew_rateis a fixed normalised [0, 1] scheme;pcis keyed raw. Derived axes are written tometa.yaml::skew_fit.bucket_axesand the simulator reads them from there, so wideningMAX_NUM_SEQSorATTENTION_MAX_KVlights up finer resolution without any simulator code change - Per-axis skew density knobs:
SKEW_N_FACTOR/SKEW_PC_FACTOR/SKEW_KP_FACTOR/SKEW_KVS_FACTOR(CLI:--skew-*-factor, default 2.0 = doubling). Crank higher to coarsen a given axis and cut profile time; effective values land inmeta.yaml::skew_profile.factors - Per-TP
skew_fit.csvfile spills the full per-bucket alpha table out ofmeta.yamlso the latter stays readable (~100 lines vs ~3100 lines for Qwen3-32B at 2 TPs).meta.yaml::skew_fit.per_tp[tp].bucket_tablepoints attp<N>/skew_fit.csv; the simulator hydrates it back intoalpha_by_bucketon_load_perf_db() - Compact
attention_grid/skew_profilegrid specs inmeta.yaml(e.g."0, 16-2048 x2"instead of the full value list) - RTXPRO6000 (NVIDIA RTX PRO 6000 Blackwell) hardware support: 96 GB, 1597 GB/s, 600W TDP
- DP+EP (Data Parallel + Expert Parallel) support with ASTRA-Sim ALLTOALL synchronization
via
involved_dimdimension scoping. Instances with the samedp_groupshare a single ASTRA-Sim process; the 2D topology[tp_size, dp_group_size]enables per-dimension collective routing (ALLREDUCE on TP dim, ALLTOALL on DP dim) - Wave synchronization for DP groups: Python-side
dp_pendingbarrier ensures all instances schedule before trace generation. ALLTOALLcomm_sizesynchronized tomax(total_len)across the group. Dummy batches keep idle instances participating in ALLTOALL sync single_node_moe_dp_ep_instance.jsoncluster config for MoE with DP+EP (2 instances, TP=1, EP=2, same DP group)- Agentic session support for closed-loop workloads (e.g., SWE-bench). The new JSONL
format uses
sub_requestsarrays withtool_duration_nsto model dependency chains where each LLM call waits for the previous one to complete plus tool execution time. The router dynamically releases sub-requests as their predecessors finish, enabling accurate simulation of multi-step agentic workflows --num-reqsCLI argument (replaces--num-req), default changed from 100 to 0 (load all entries from dataset). For agentic datasets, counts sessions not sub-requests- Example SWE-bench agentic dataset (
workloads/swe-bench-qwen3-30b-a3b-50-sps0.2.jsonl) - Qwen3-32B and Qwen3-30B-A3B-Instruct-2507 model configs with explicit
head_dimsupport for models wherehead_dim != hidden_size // num_attention_heads - FP8 KV cache simulation support (
--kv-cache-dtype fp8): selectsprofile_fp8.csvfor compute latency lookup and halves KV cache memory usage in the memory model - FP8 KV cache profiling support (
kv_cache_dtype: "fp8"in receipts, outputsprofile_fp8.csv) - Chunked prefill support (enabled by default, matching vLLM v1) with
--long-prefill-token-thresholdfor per-request token cap per step (chunked prefill core by @HyunsuYEE) - Chunked prefill compatible with prefix caching (RadixAttention)
- Prefix cache lock tracking (
_prefix_locked) to prevent incorrect eviction during multi-chunk prefill - Non-Docker vLLM installer (
scripts/install-vllm.sh) usinguvwith precompiled vLLM 0.19.0 wheels (@junwha) - End-to-end vLLM benchmark + simulator validation suite (
bench/, invoked aspython -m bench {run,validate}).bench runreplays a workload through a real vLLMAsyncLLMengine withoutput_tokspinned viaSamplingParams(min_tokens=N, max_tokens=N, ignore_eos=True)so results are bit-for-bit comparable to the simulator's view of the same dataset. A customvllm.v1.metrics.loggers.StatLoggerBasewrites per-tick scheduler / iteration stats;RequestStateStatsfromvllm.v1.metrics.statslands inrequests.jsonl.bench validateloads a finished run plus the simulator'ssim.csv/sim.logand emits throughput, running/waiting, and TTFT/TPOT/latency-CDF plots plus a numeric diff% summary - Workload generators (
workloads/generators/, invoked aspython -m workloads.generators sharegpt …). Multi-turn ShareGPT parser with running context accumulation; default sourceshibing624/sharegpt_gpt4. Runs in tokenizer-only mode by default (output IDs from the assistant turn) or with--use-vllmto drive an offline batchedvllm.LLMfor free-generated outputs at maximum throughput. Optional--fix-len(random fixed-length tokens) and--pulse(bursty arrivals) modes - Per-model invocation templates under
workloads/examples/(gen-llama-3.1-8b.sh,gen-qwen3-30b-a3b.sh,gen-qwen3-32b.sh) - Module READMEs for
bench/,scripts/(top-level wrappers for the vLLM and simulator container launchers, the bare-metal vLLM installer, and the ASTRA-Sim build) - Rich-backed logger shared between simulator, profiler, and bench
(
serving/core/logger.py,profiler/core/logger.py,bench/core/logger.py). Keeps the original[HH:MM:SS.mmm] [Component] [node=X,inst=Y] LEVEL msgline shape via a custom_RichSimHandler(public API unchanged —configure_logger/get_logger/ theComponentLoggerAdapterstill work for every existing call site) and adds:.success()(green ✓ at INFO) and.summary()(verbatim, no prefix) on the adapter, plus module-levelprint_banner()/print_input_config()/print_markup()/print_rule()andstage(title)/progress(label, total)context managers mirroring the profiler's helpers.- Rich theme +
soft_wrap=Trueso colour renders in interactive terminals, long lines stay on one logical row, and redirected files (> out.log,nohup…) get clean plain-text logs with no stray ANSI escape bytes.FORCE_COLOR=1still forces colour when an IDE terminal doesn't self-identify as a TTY. - Banner / logo / input-config / simulation-results blocks in
serving/__main__.pymigrated to the new helpers (withbench/__main__.pyusing the same banner / stage / progress conventions); heartbeat status tree (├─/└─) now builds each line as a string and emits via Rich markup for consistent colouring. RadixCache.format_prefix_info(),Scheduler.print_result(), andPowerModel.print_power_summary()rewritten around the new helpers.serving/utils.pyloses its ANSI colour wrappers (cyan/bold/ANSI_*/ …) and the logo / input-config renderers now live inlogger.py
- READMEs for
configs/model/,configs/pim/,workloads/,serving/ .gitignoreentries for AI agent cache files (.claude/,.cursor/,.copilot/,.codex/,.aider*,.continue/)
Fixed
- Skew sweep feasibility filter used strict
n_reqs >= max_num_seqsand dropped everyn = MSQcase (including the pure-decode corner the attention sweep was already allowing). Relaxed to>to match attention and unlock puren = MSQshots. Mixed-regimen = MSQ(requires MSQ+1 requests) still filtered; profile withMAX_NUM_SEQSone above runtime MSQ to cover that corner too - Missing
prefix_matchcall on non-chunked prefill path: prefix cache hits were not detected for full prefill requests, preventing prefix caching benefits when chunked prefill was disabled (@junwha) - Typo in timer reference in legacy Mixtral profiler model (@junwha)
- Prompt throughput now includes prefix cache hit tokens. Previously only actually computed prefill tokens were counted, making throughput appear lower than vLLM's reported prompt throughput when prefix caching was active
- Prefix cache
is_initnever cleared for full prefix cache hits, causingtotal_requested_tokensto inflate on every decode step andlock_refleaks - Prefix cache
lock_prefixnot called for full prefix hits, causing memory leaks at simulation end - MoE expert latency aggregated both EP ranks onto one GPU (2x overestimate); now each GPU uses only its own rank's tokens and activated experts
- MoE weight calculation in
memory_model.pynow usesep_size(nottp_size) for expert weight sharding - Status print timing: only prints on start NPU to avoid transient "0 running" states
system.jsoncollective implementations now match topology dimensions (2 entries for 2D topologies) — previously 1 entry caused ASTRA-Sim to create only 1 dimension- DP group termination: instances wait for all DP members to finish before marking done
argparseallow_abbrev=Falseto prevent silent prefix matching of wrong arguments- Add missing
return parser.parse_args()in legacy profiler layers/main.py (reported and fixed by @junwha, @gleb-kun)
Changed
--fpflag replaced with--dtype(vLLM-style:float16,bfloat16,float32,int8)--genflag replaced with--skip-prefillfor clarity--request-routing-policydefault changed fromRRtoLOAD(vLLM-style weighted least-loaded). Requests are now routed in real-time based on current system state instead of upfront assignment--expert-routing-policyFASTrenamed toCOPYfor clarity (enables block copy)- Cluster config:
npu_num/npu_groupreplaced withtp_size/pp_size/ep_size/dp_group. Partial configs supported (e.g.,num_npus=4, tp_size=2inferspp_size=2). TP and EP share the same GPU set; DP via multiple instances with samedp_group - MoE modeling: per-EP-rank latency lookup (
key_0=local_tokens, key_1=activated_experts), even expert-to-rank partitioning, ASTRA-Sim ALLTOALL withinvolved_dimfor cross-DP sync - MoE
calculate_sizes: usesmoe_intermediate_size(per-expert FFN dim) separate fromintermediate_size(dense FFN dim) calculate_sizesparameter renamed:tp→parallel(generic for TP or EP)- Trace
comm_typenow supports dimension scoping:ALLREDUCE:1,0,ALLTOALL:0,1 - Network topology for DP groups:
npus_count: [tp_size, dp_group_size]with per-dimension collective implementations insystem.json - Removed analytical ALLTOALL workaround functions (
_inflate_comm_size,_ring_alltoall_time_ns,_bw_gb_to_bpns) — replaced by native ASTRA-Sim ALLTOALL link_bw/link_latencyremoved fromTraceCtxandgenerate_trace(no longer needed for analytical fallback)- Latency lookup extrapolates beyond profiled range instead of clamping for improved accuracy on large batch sizes
- Profiler rewritten from PyTorch Profiler + scikit-learn predictor to direct vLLM
layerwise_profile()approach. Architecture yamls live inprofiler/models/keyed on the HF config'smodel_type; CLI flags match vLLM (--dtype,--kv-cache-dtype,--max-num-batched-tokens,--max-num-seqs,--tp,--variant). Docker pinned to vLLM v0.19.0 (vllm/vllm-openai:v0.19.0orv0.19.0-cu130for CUDA 13.x) - Old profiler preserved under
profiler/v0/for reference - Layer names unified between profiler and simulator:
qkv_projection,o_projection,ffn1,ffn2,attention,layernorm(old names removed) memory_model.pyupdated to use explicithead_dimandq_dim/kv_dimfor correct tensor size computation on models like Qwen3trace_generator.pyrewritten with composable helpers (TraceCtx,BatchCtx,_emit_layer,_emit_pre_attn_layers,_emit_post_attn_layers) and unified profile CSV lookup with 2D bilinear interpolation- Sampler output location changed to
REMOTE(was onlm_head) to match Chakra converter's MEM_STORE node placement - Removed
--enable-attn-predictionflag (scikit-learn predictor replaced by direct profiled latency lookup) - Cluster configs updated to RTXPRO6000 hardware specs
AGENTS.mdexpanded with full repo structure, simulation flow, trace format documentation, and additional pitfalls--max-batchrenamed to--max-num-seqs(default: 128, matching vLLM); now limits total running requests across inflight batches--enable-chunked-prefillnow enabled by default (matching vLLM v1); use--no-enable-chunked-prefillto disable--enable-prefix-cachingnow enabled by default (matching vLLM v1); use--no-enable-prefix-cachingto disable- Scheduler rewritten to use vLLM-style token-budget-based allocation for both
chunked and non-chunked prefill paths (
schedule_base,schedule_with_prefix) - KV cache block allocation uses vLLM-style cumulative ceiling division
- Radix tree
cache_unfinished_reqnow usesnum_computed_tokensinstead ofreq.input, enabling correct incremental caching across chunks - Prefix cache memory accounting changed to free-before-allocate order
- Hash-to-length map in
memory_model.pychanged from{hash: tlen}to{hash: [tlen, refcount]}to handle duplicate block hashes - All
Requestattributes now properly initialized in__init__; removedgetattrfallbacks throughout scheduler and radix tree - Directory restructuring:
cluster_config/→configs/cluster/model_config/→configs/model/pim_config/→configs/pim/dataset/→workloads/(the directory holds ShareGPT-style request workloads consumed by the simulator and bench)output/→outputs/script/→scripts/llm_profile/→profiler/legacy_profiler/(later moved toprofiler/v0/)
- Top-level package layout finalized as Python-style sibling modules:
inference_serving/→serving/with internals underserving/core/(every.pypreviously at the package root now lives one directory deeper); entrypointmain.pybecomesserving/__main__.pyand is invoked aspython -m serving ….llm_profiler/→profiler/(collapses the duplicatedllm_profiler/profiler/package layer) with internals underprofiler/core/andprofiler/core/hooks/.bench/added with the same shape (bench/core/).workloads/ships the ShareGPT generator underworkloads/generators/sharegpt.py(invoked aspython -m workloads.generators sharegpt …) with per-model invocation templates underworkloads/examples/. The package deliberately avoids the namedatasets/so the HuggingFacedatasetslibrary imports cleanly.- Module-specific shell scripts live at the module home (e.g.
profiler/profile.sh,bench/bench.sh,serving/run.sh); only cross-cutting environment / build helpers stay inscripts/(docker-vllm.sh,docker-sim.sh,install-vllm.sh,compile.sh).
- Evaluation configs moved from
config/toconfigs/subdirectories within each figure folder run.shupdated with reorganized examples and commented out unavailable MoE config
Removed
internal/directory (debug docs and scheduler tests moved or removed)scripts/batch experiment scripts (superseded byrun.shexamples)evaluation/directory (preserved onispass26-artifactbranch)--enable-attn-predictionflag and scikit-learn attention predictor--fpflag (replaced by--dtype)--genflag (replaced by--skip-prefill)--expert-routing-policy FAST(renamed toCOPY)serving/attn_utils.py(stale scikit-learn attention feature helper)npu_num/npu_groupconfig fields (replaced bytp_size/pp_size/ep_size)--num-reqflag (replaced by--num-reqs)- Analytical ALLTOALL workaround functions (
_inflate_comm_size,_ring_alltoall_time_ns) evaluation/directory (preserved onispass26-artifactbranch)
[v1.0.0] - 2026-02-25
Added
- Multi-instance simulation with configurable request routing policies (Round Robin, Random, Custom)
- Prefill/Decode (P/D) disaggregation support across instances
- Mixture of Experts (MoE) support with expert parallelism, expert offloading, and configurable routing policies (Round Robin, Random, Fast, Custom)
- Prefix caching using RadixAttention (based on SGLang), with support for second-tier prefix cache
pooling across CPU and CXL memory (
--enable-prefix-caching,--enable-prefix-sharing) - Sub-batch interleaving to overlap prefill and decode phases within an iteration
(
--enable-sub-batch-interleaving) - Attention latency predictor using scikit-learn for real-time per-request estimation
(
--enable-attn-prediction) - Power and energy modeling per node covering NPU, CPU, DRAM, interconnect, NIC, and storage
- CXL memory expansion support with configurable bandwidth and latency
- Enhanced PIM (Processing-In-Memory) model with per-device INI configuration (
configs/pim/) - Cluster-level configuration system (
configs/cluster/*.json) that consolidates all hardware, topology, and placement parameters into a single file - Per-layer weight, KV cache, and expert placement rules in cluster config
- Additional latency metrics: ITL (Inter-Token Latency) and p99 for TTFT, TPOT, ITL
- Hardware performance profiles for TPU-v6e-1
- Batch experiment scripts for systematic evaluation (
scripts/) - Artifact evaluation scripts and reference results (
evaluation/) llm_profileintegrated as a local module with support for MoE models and power profiling
Changed
- All hardware and topology parameters are now specified via
cluster_configJSON files; per-invocation hardware arguments (--model_name,--hardware,--npu_num, etc.) are removed - Command-line argument style changed from underscore to hyphen (e.g.,
--cluster-config,--num-req,--block-size) - Dataset format changed from
.tsvto.jsonl - Build process consolidated into
./compile.shand./docker.sh - Performance model directory relocated from
perf_model/tollm_profile/perf_models/ serving/modules renamed for clarity:control.py→controller.pygenerate_graph.py→graph_generator.pygenerate_trace.py→trace_generator.pyconfig_generator.py→config_builder.pypim.py→pim_model.py
- Fix incorrect
evict_sizeaccumulation
Removed
trace_test/directory (superseded byevaluation/scripts)- Direct per-invocation hardware arguments (
--model_name,--hardware,--npu_num,--npu_group,--npu_mem,--remote_bw,--link_bw)
[v0.2.1] - 2025-07-18
Added
llm_profilemodule with PyTorch Profiler for GPU layer and attention latency measurement- Llama-3.1-8B-Instruct model support (replaces GPT-3 6.7B as the default model)
- Hugging Face model configuration support for easy addition of new models
Changed
- Function names standardized to snake_case (e.g.,
createNetworkConfig→create_network_config,calculateSizes→calculate_sizes) - Model configuration files updated to Llama-3.1-8B-Instruct format
Fixed
- Collective operation stall caused by unresolved dependencies in the ASTRA-Sim workload graph
- Network dimension calculation for full pipeline parallelism (
npus_per_dimformula corrected)
[v0.2.0] - 2025-06-04
Changed
- ASTRA-Sim submodule updated to latest version (branch
v0.2.0) - Chakra updated to latest version
- Network configuration format changed from JSON to YAML
local_bwandremote_bwparameters replaced withlink_latency- Conda environment dependencies updated and simplified
[v0.1.0] - 2025-01-03
Added
- GPU performance model based on TensorRT-LLM profiling (replaces NPU simulator)
- Auto config generator for network and memory configurations
- New parameters:
--hardware,--local_bw,--remote_bw,--link_bw,--fp - Additional metrics:
queuing_delay, TTFT, TPOT - Verbose logging option for detailed execution output
Changed
- ASTRA-Sim submodule branch updated from
artifacttov0.1.0 - Output format changed from TSV to CSV
Removed
- Polymath and codelets_src submodules (NPU simulator components replaced by performance model)
[artifact] - 2024-06-23
Added
- Initial project release as IISWC 2024 artifact: "LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale"
- NPU simulator-based co-simulation infrastructure (ASTRA-Sim + Polymath + codelets_src)
- Evaluation scripts and benchmark results
- Conda environment configuration (
environment.yml)