python -m serving CLI flags
Complete reference for every command-line flag accepted by
python -m serving. For the conceptual side of each flag (what it
does internally), see Simulator.
Cluster topology
| Flag | Type | Default | Description |
|---|
--cluster-config | path | configs/cluster/single_node_single_instance.json | Path to a cluster-config JSON. See Cluster config |
--network-backend | choice | analytical | Network simulation backend. analytical (fast) or ns3 (detailed, WIP) |
Batching and scheduling
| Flag | Type | Default | Description |
|---|
--max-num-seqs | int | 128 | Max sequences in a batch. 0 = unlimited |
--max-num-batched-tokens | int | 2048 | Max tokens per iteration across all requests (token budget) |
--long-prefill-token-threshold | int | 0 | Per-request token cap per step for chunked prefill. 0 = disabled |
--enable-chunked-prefill | bool | True | Split long prefill across iterations. Use --no-enable-chunked-prefill to disable |
--prioritize-prefill | flag | off | Run prefill before decode in the same iteration |
--block-size | int | 16 | KV cache block size in tokens |
--skip-prefill | flag | off | Skip prefill, run decode only |
Routing
| Flag | Choices | Default | Description |
|---|
--request-routing-policy | LOAD / RR / RAND / CUSTOM | LOAD | Cross-instance request routing |
--expert-routing-policy | BALANCED / RR / RAND / CUSTOM | BALANCED | MoE expert token routing |
--enable-block-copy | bool | True | Replay one block's trace across layers (set False for per-layer EP variance) |
Precision
| Flag | Choices | Default | Description |
|---|
--dtype | float16 / bfloat16 / float32 / fp8 / int8 | model's torch_dtype, fallback bfloat16 | Model weight dtype |
--kv-cache-dtype | auto / fp8 | auto (inherits dtype) | KV cache dtype. fp8 halves KV memory and selects a *-kvfp8 profile variant |
Prefix caching and offloading
| Flag | Default | Description |
|---|
--enable-prefix-caching | True | RadixAttention prefix caching. Use --no-enable-prefix-caching to disable |
--enable-prefix-sharing | off | Second-tier prefix pool shared across instances within a node |
--prefix-storage | None | Where the second-tier pool lives. None / CPU / CXL |
--enable-local-offloading | off | Weight offloading to NPU (counts weight reads in profiling) |
--enable-attn-offloading | off | Attention computation offloading to PIM |
--enable-sub-batch-interleaving | off | Overlap GPU compute with PIM attention. Requires --enable-attn-offloading |
Dataset and output
| Flag | Type | Default | Description |
|---|
--dataset | path | None | JSONL workload file. See Workloads → JSONL format |
--num-reqs | int | 0 | Entries to load from the dataset (0 = all). For agentic, each entry is a session |
--output | path | None | Per-request CSV output path. Stdout only if None |
Logging
| Flag | Type | Default | Description |
|---|
--log-interval | float | 1.0 | Seconds between throughput / memory log lines |
--log-level | choice | WARNING | WARNING (default) / INFO / DEBUG |
Quick reference: which flag for which feature
| Feature | Flag(s) |
|---|
| Multi-instance (parallelism via cluster config) | (cluster config num_instances) |
| Tensor parallel | (cluster config tp_size) |
| MoE expert parallel | (cluster config ep_size) |
| DP+EP MoE | (cluster config dp_group) |
| Prefix caching | --enable-prefix-caching (default on), --enable-prefix-sharing, --prefix-storage |
| Chunked prefill | --enable-chunked-prefill (default on), --long-prefill-token-threshold |
| PIM attention offload | --enable-attn-offloading (cluster config sets pim_config) |
| FP8 KV cache | --kv-cache-dtype fp8 |
| ns3 backend | --network-backend ns3 |
For the full conceptual treatment of each feature, browse the
Simulator section. For runnable
examples, see Examples.