DP+EP MoE

What this demonstrates: stretching expert parallelism across multiple instances using dp_group. Two instances form a single 2D ASTRA-Sim topology with TP on one axis and EP on the other.

This is the most interesting topology LLMServingSim can model. Two serving instances each run their own TP group, but they share experts via cross-instance ALLTOALL. The 2D ASTRA-Sim network mesh routes TP-ALLREDUCE on dim 0 and EP-ALLTOALL on dim 1.

Prerequisites

Simulator container set up
Bundled RTXPRO6000 profile for Qwen/Qwen3-30B-A3B-Instruct-2507
An agentic dataset (or any workload with enough requests to keep both instances busy)

Cluster config

configs/cluster/single_node_moe_dp_ep_instance.json:

configs/cluster/single_node_moe_dp_ep_instance.json
{
  "num_nodes": 1,
  "link_bw": 16,
  "link_latency": 20000,
  "nodes": [
    {
      "num_instances": 2,
      "cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
      "instances": [
        {
          "model_name": "Qwen/Qwen3-30B-A3B-Instruct-2507",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "num_npus": 1, "tp_size": 1, "ep_size": 2, "dp_group": "A",
          "pd_type": null
        },
        {
          "model_name": "Qwen/Qwen3-30B-A3B-Instruct-2507",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "num_npus": 1, "tp_size": 1, "ep_size": 2, "dp_group": "A",
          "pd_type": null
        }
      ]
    }
  ]
}

The two pieces that turn this from "two independent instances" into a DP+EP cluster:

dp_group: "A" on both instances, instances with the same string form one DP group.
ep_size: 2 while tp_size: 1: EP spans the DP group (ep_size > tp_size only allowed when a dp_group is set).

config_builder.py sees the DP group and emits a 2D ASTRA-Sim topology [tp_size, dp_group_size] = [1, 2]. Collectives are scoped per dimension via the involved_dim BoolList:

TP-ALLREDUCE: [True, False]: dim 0 only (within instance, no-op here since tp_size=1)
EP-ALLTOALL: [False, True]: dim 1 only (across the two instances)

Run

python -m serving \
  --cluster-config 'configs/cluster/single_node_moe_dp_ep_instance.json' \
  --dtype bfloat16 --block-size 16 \
  --dataset 'workloads/swe-bench-qwen3-30b-a3b-50-sps0.2.jsonl' \
  --output 'outputs/dp_ep_moe_run.csv' \
  --num-req 1

Notes on the flags:

The bundled SWE-bench-style agentic dataset is a good fit because each session has multiple chained sub-requests, which keeps both instances active.
--num-req 1 means one session (multiple sub-requests). Bump it up for a longer run.

Expected output

[INFO] step=8 batch=4+4 prompt_t=1.4k tok/s decode_t=520 tok/s
       npu_mem=[81.2 GB, 81.2 GB] alltoall=512 KB
[INFO] step=9 batch=4+4 prompt_t=1.5k tok/s decode_t=540 tok/s
       npu_mem=[81.2 GB, 81.2 GB] alltoall=512 KB

The batch=4+4 notation reflects per-instance batch (instance 0 + instance 1). The alltoall field shows the wave-synchronized ALLTOALL message size, which equals max(total_len_per_instance) * hidden_size * fp_size.

What's interesting

Expert weights split across instances, not just GPUs. With ep_size=2 and 128 total experts, each instance holds 64. Per-GPU weight memory is roughly halved compared to single-instance EP=1.
One ASTRA-Sim process for both instances. Wave-synchronized scheduling across the DP group means the simulator generates .et files for both instances that share stream IDs on the ALLTOALL, forcing ASTRA-Sim to block until both NPUs reach the collective.
Idle instance gets a dummy batch. When one instance has no pending work, the scheduler synthesizes a 1-decode-token batch so it can still participate in the wave's ALLTOALL. Same when one instance finishes early, it keeps generating dummies until the group does.
comm_size is synchronized. Even if instance A has a much bigger batch than instance B, both pass the same ALLTOALL size to ASTRA-Sim, set to the max, so the network model sees the heavier side.

Expert parallel: the same MoE model on a single instance (EP within one TP group).
Multi-instance LOAD routing - the non-DP version of multi-instance: independent replicas without expert sharing.
Cluster config explained - the field-level walkthrough of how dp_group lights up the 2D topology.

Prerequisites​

Cluster config​

Run​

Expected output​

What's interesting​

Related examples​

Prerequisites

Cluster config

Run

Expected output

What's interesting

Related examples