Prefill/decode disaggregation (P/D split)

What this demonstrates: dedicating one instance to prefill and another to decode, with KV-cache hand-off between them.

Prefill is compute-bound and bursty. Decode is memory-bandwidth-bound and steady. Mixing them on one instance forces compromises (long prefills hold up decode iterations, decode batches under-utilize compute on small batches). Separating prefill and decode onto different instances lets each be tuned for its own bottleneck, a pattern popularized by DistServe and now standard in production serving.

Prerequisites

Simulator container set up
Bundled RTXPRO6000 profile for meta-llama/Llama-3.1-8B

Cluster config

configs/cluster/single_node_pd_instance.json:

configs/cluster/single_node_pd_instance.json
{
  "num_nodes": 1,
  "link_bw": 16,
  "link_latency": 20000,
  "nodes": [
    {
      "num_instances": 2,
      "cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
      "instances": [
        {
          "model_name": "meta-llama/Llama-3.1-8B",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "pd_type": "prefill",
          "tp_size": 1
        },
        {
          "model_name": "meta-llama/Llama-3.1-8B",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "pd_type": "decode",
          "tp_size": 1
        }
      ]
    }
  ]
}

The single field that matters: pd_type.

"prefill": the instance only runs prefill iterations. It receives the request, computes the prompt's KV cache, hands the cache off, and goes back to its prefill queue.
"decode": the instance only does decode iterations on already-prefilled requests.
null (other examples): combined prefill+decode (the default).

The router automatically sends new requests to a "prefill" instance, then transfers the request to a "decode" instance after the first token. The KV-cache transfer cost is modeled through the inter-link.

Run

python -m serving \
  --cluster-config 'configs/cluster/single_node_pd_instance.json' \
  --dtype float16 --block-size 16 \
  --dataset 'workloads/example_trace.jsonl' \
  --output 'outputs/pd_split_run.csv' \
  --log-interval 1.0

Expected output

[INFO] step=15 P=8 D=12 prompt_t=3.1k tok/s decode_t=620 tok/s
       npu_mem=[55.4 GB, 71.2 GB]
[INFO] step=16 P=10 D=12 prompt_t=3.6k tok/s decode_t=640 tok/s
       npu_mem=[55.4 GB, 71.4 GB]

The P= and D= counts are per-role batch sizes. The decode instance's KV cache grows over time as requests transfer in; the prefill instance's KV cache stays bounded by the requests currently being prefilled.

What's interesting

TTFT is dominated by the prefill instance alone, no decode workload competes for it. Easy to tune by scaling out prefill replicas independently.
TPOT is dominated by the decode instance. The cache transfer shows up as a one-time cost between prefill end and first decode step (modeled via link_bw).
Memory utilization differs sharply. Prefill instance's KV cache is "in flight" only; decode instance accumulates KV blocks for every active request. With a long-running workload the decode instance is usually the memory bottleneck.
Production parallel. This pattern is what DistServe and Mooncake popularized. Tuning the prefill:decode replica ratio is the main knob, try num_instances: 3 with two prefills and one decode for bursty long-context workloads.

Multi-instance LOAD routing: the equal-role version. Useful baseline to compare TTFT/TPOT against.
Cluster config explained - the field-level reference for pd_type.

Prerequisites​

Cluster config​

Run​

Expected output​

What's interesting​

Related examples​

Prerequisites

Cluster config

Run

Expected output

What's interesting

Related examples