Skip to main content

Prefill/decode disaggregation (P/D split)

What this demonstrates: dedicating one instance to prefill and another to decode, with KV-cache hand-off between them.

Prefill is compute-bound and bursty. Decode is memory-bandwidth-bound and steady. Mixing them on one instance forces compromises (long prefills hold up decode iterations, decode batches under-utilize compute on small batches). Separating prefill and decode onto different instances lets each be tuned for its own bottleneck, a pattern popularized by DistServe and now standard in production serving.

Prerequisites

  • Simulator container set up
  • Bundled RTXPRO6000 profile for meta-llama/Llama-3.1-8B

Cluster config

configs/cluster/single_node_pd_instance.json:

configs/cluster/single_node_pd_instance.json
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 2,
"cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
"instances": [
{
"model_name": "meta-llama/Llama-3.1-8B",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"pd_type": "prefill",
"tp_size": 1
},
{
"model_name": "meta-llama/Llama-3.1-8B",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"pd_type": "decode",
"tp_size": 1
}
]
}
]
}

The single field that matters: pd_type.

  • "prefill": the instance only runs prefill iterations. It receives the request, computes the prompt's KV cache, hands the cache off, and goes back to its prefill queue.
  • "decode": the instance only does decode iterations on already-prefilled requests.
  • null (other examples): combined prefill+decode (the default).

The router automatically sends new requests to a "prefill" instance, then transfers the request to a "decode" instance after the first token. The KV-cache transfer cost is modeled through the inter-link.

Run

python -m serving \
--cluster-config 'configs/cluster/single_node_pd_instance.json' \
--dtype float16 --block-size 16 \
--dataset 'workloads/example_trace.jsonl' \
--output 'outputs/pd_split_run.csv' \
--log-interval 1.0

Expected output

[INFO] step=15 P=8 D=12 prompt_t=3.1k tok/s decode_t=620 tok/s
npu_mem=[55.4 GB, 71.2 GB]
[INFO] step=16 P=10 D=12 prompt_t=3.6k tok/s decode_t=640 tok/s
npu_mem=[55.4 GB, 71.4 GB]

The P= and D= counts are per-role batch sizes. The decode instance's KV cache grows over time as requests transfer in; the prefill instance's KV cache stays bounded by the requests currently being prefilled.

What's interesting

  • TTFT is dominated by the prefill instance alone, no decode workload competes for it. Easy to tune by scaling out prefill replicas independently.
  • TPOT is dominated by the decode instance. The cache transfer shows up as a one-time cost between prefill end and first decode step (modeled via link_bw).
  • Memory utilization differs sharply. Prefill instance's KV cache is "in flight" only; decode instance accumulates KV blocks for every active request. With a long-running workload the decode instance is usually the memory bottleneck.
  • Production parallel. This pattern is what DistServe and Mooncake popularized. Tuning the prefill:decode replica ratio is the main knob, try num_instances: 3 with two prefills and one decode for bursty long-context workloads.