Skip to main content

python -m serving CLI flags

Complete reference for every command-line flag accepted by python -m serving. For the conceptual side of each flag (what it does internally), see Simulator.

Cluster topology

FlagTypeDefaultDescription
--cluster-configpathconfigs/cluster/single_node_single_instance.jsonPath to a cluster-config JSON. See Cluster config
--network-backendchoiceanalyticalNetwork simulation backend. analytical (fast) or ns3 (detailed, WIP)

Batching and scheduling

FlagTypeDefaultDescription
--max-num-seqsint128Max sequences in a batch. 0 = unlimited
--max-num-batched-tokensint2048Max tokens per iteration across all requests (token budget)
--long-prefill-token-thresholdint0Per-request token cap per step for chunked prefill. 0 = disabled
--enable-chunked-prefillboolTrueSplit long prefill across iterations. Use --no-enable-chunked-prefill to disable
--prioritize-prefillflagoffRun prefill before decode in the same iteration
--block-sizeint16KV cache block size in tokens
--skip-prefillflagoffSkip prefill, run decode only

Routing

FlagChoicesDefaultDescription
--request-routing-policyLOAD / RR / RAND / CUSTOMLOADCross-instance request routing
--expert-routing-policyBALANCED / RR / RAND / CUSTOMBALANCEDMoE expert token routing
--enable-block-copyboolTrueReplay one block's trace across layers (set False for per-layer EP variance)

Precision

FlagChoicesDefaultDescription
--dtypefloat16 / bfloat16 / float32 / fp8 / int8model's torch_dtype, fallback bfloat16Model weight dtype
--kv-cache-dtypeauto / fp8auto (inherits dtype)KV cache dtype. fp8 halves KV memory and selects a *-kvfp8 profile variant

Prefix caching and offloading

FlagDefaultDescription
--enable-prefix-cachingTrueRadixAttention prefix caching. Use --no-enable-prefix-caching to disable
--enable-prefix-sharingoffSecond-tier prefix pool shared across instances within a node
--prefix-storageNoneWhere the second-tier pool lives. None / CPU / CXL
--enable-local-offloadingoffWeight offloading to NPU (counts weight reads in profiling)
--enable-attn-offloadingoffAttention computation offloading to PIM
--enable-sub-batch-interleavingoffOverlap GPU compute with PIM attention. Requires --enable-attn-offloading

Dataset and output

FlagTypeDefaultDescription
--datasetpathNoneJSONL workload file. See Workloads → JSONL format
--num-reqsint0Entries to load from the dataset (0 = all). For agentic, each entry is a session
--outputpathNonePer-request CSV output path. Stdout only if None

Logging

FlagTypeDefaultDescription
--log-intervalfloat1.0Seconds between throughput / memory log lines
--log-levelchoiceWARNINGWARNING (default) / INFO / DEBUG

Quick reference: which flag for which feature

FeatureFlag(s)
Multi-instance (parallelism via cluster config)(cluster config num_instances)
Tensor parallel(cluster config tp_size)
MoE expert parallel(cluster config ep_size)
DP+EP MoE(cluster config dp_group)
Prefix caching--enable-prefix-caching (default on), --enable-prefix-sharing, --prefix-storage
Chunked prefill--enable-chunked-prefill (default on), --long-prefill-token-threshold
PIM attention offload--enable-attn-offloading (cluster config sets pim_config)
FP8 KV cache--kv-cache-dtype fp8
ns3 backend--network-backend ns3

For the full conceptual treatment of each feature, browse the Simulator section. For runnable examples, see Examples.