Multi-instance with LOAD routing

What this demonstrates: running two independent serving instances on the same node with vLLM-style load-aware request routing.

This is the simplest "scale out" pattern: replicate the same model across multiple instances and let the router pick the least-loaded one for each new request. It's what real production deployments (vLLM, TGI, SGLang) do under a load balancer.

Prerequisites

Simulator container set up
Bundled RTXPRO6000 profile for meta-llama/Llama-3.1-8B

Cluster config

configs/cluster/single_node_multi_instance.json:

configs/cluster/single_node_multi_instance.json
{
  "num_nodes": 1,
  "link_bw": 16,
  "link_latency": 20000,
  "nodes": [
    {
      "num_instances": 2,
      "cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
      "instances": [
        {
          "model_name": "meta-llama/Llama-3.1-8B",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "pd_type": null,
          "tp_size": 1
        },
        {
          "model_name": "meta-llama/Llama-3.1-8B",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "pd_type": null,
          "tp_size": 1
        }
      ]
    }
  ]
}

The two pieces:

num_instances: 2 and matching instances array length.
Each instance is independent, same model, same hardware, no dp_group (so they don't share experts or wave-sync).

Run

python -m serving \
  --cluster-config 'configs/cluster/single_node_multi_instance.json' \
  --dtype float16 --block-size 16 \
  --dataset 'workloads/example_trace.jsonl' \
  --output 'outputs/multi_instance_run.csv' \
  --request-routing-policy LOAD \
  --log-interval 1.0

--request-routing-policy LOAD is the default but explicit here for clarity. Other options:

RR: pure round-robin
RAND: random pick
CUSTOM: pluggable in serving/core/router.py

Expected output

[INFO] step=20 inst0_batch=6 inst1_batch=4 prompt_t=2.4k tok/s decode_t=820 tok/s
       npu_mem=[63.2 GB, 63.2 GB]
[INFO] step=21 inst0_batch=6 inst1_batch=5 prompt_t=2.5k tok/s decode_t=860 tok/s
       npu_mem=[63.2 GB, 63.2 GB]

The router fills the lighter-loaded instance first. With LOAD policy, pending tokens (running + waiting) and active KV-cache footprint both weight the choice, same algorithm vLLM uses.

The output CSV gets one row per finished request as usual; each row has an instance_id column so you can split by replica.

What's interesting

Throughput roughly 2× single-instance for typical workloads, modulo PCIe / link contention from the shared host (modeled via link_bw).
Memory doubles linearly: the model is fully replicated. No free lunch on weight memory; that's what TP / EP / DP+EP solve.
Per-instance KV cache. Prefix caching is per-instance by default, a request that lands on instance 1 can't reuse a prefix computed on instance 0 unless prefix sharing is enabled (see Prefix caching).

Tensor parallel: the alternative way to use 2 GPUs (one bigger instance instead of two replicas).
Prefill/decode split: multi-instance but with specialized roles per instance.
DP+EP MoE: multi-instance for MoE with cross-instance expert sharing.

Prerequisites​

Cluster config​

Run​

Expected output​

What's interesting​

Related examples​

Prerequisites

Cluster config

Run

Expected output

What's interesting

Related examples