DP+EP MoE
What this demonstrates: stretching expert parallelism across multiple instances using
dp_group. Two instances form a single 2D ASTRA-Sim topology with TP on one axis and EP on the other.
This is the most interesting topology LLMServingSim can model. Two serving instances each run their own TP group, but they share experts via cross-instance ALLTOALL. The 2D ASTRA-Sim network mesh routes TP-ALLREDUCE on dim 0 and EP-ALLTOALL on dim 1.
Prerequisites
- Simulator container set up
- Bundled RTXPRO6000 profile for
Qwen/Qwen3-30B-A3B-Instruct-2507 - An agentic dataset (or any workload with enough requests to keep both instances busy)
Cluster config
configs/cluster/single_node_moe_dp_ep_instance.json:
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 2,
"cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
"instances": [
{
"model_name": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"num_npus": 1, "tp_size": 1, "ep_size": 2, "dp_group": "A",
"pd_type": null
},
{
"model_name": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"num_npus": 1, "tp_size": 1, "ep_size": 2, "dp_group": "A",
"pd_type": null
}
]
}
]
}
The two pieces that turn this from "two independent instances" into a DP+EP cluster:
dp_group: "A"on both instances, instances with the same string form one DP group.ep_size: 2whiletp_size: 1: EP spans the DP group (ep_size > tp_sizeonly allowed when adp_groupis set).
config_builder.py sees the DP group and emits a 2D ASTRA-Sim
topology [tp_size, dp_group_size] = [1, 2]. Collectives are scoped
per dimension via the involved_dim BoolList:
- TP-ALLREDUCE:
[True, False]: dim 0 only (within instance, no-op here sincetp_size=1) - EP-ALLTOALL:
[False, True]: dim 1 only (across the two instances)
Run
python -m serving \
--cluster-config 'configs/cluster/single_node_moe_dp_ep_instance.json' \
--dtype bfloat16 --block-size 16 \
--dataset 'workloads/swe-bench-qwen3-30b-a3b-50-sps0.2.jsonl' \
--output 'outputs/dp_ep_moe_run.csv' \
--num-req 1
Notes on the flags:
- The bundled SWE-bench-style agentic dataset is a good fit because each session has multiple chained sub-requests, which keeps both instances active.
--num-req 1means one session (multiple sub-requests). Bump it up for a longer run.
Expected output
[INFO] step=8 batch=4+4 prompt_t=1.4k tok/s decode_t=520 tok/s
npu_mem=[81.2 GB, 81.2 GB] alltoall=512 KB
[INFO] step=9 batch=4+4 prompt_t=1.5k tok/s decode_t=540 tok/s
npu_mem=[81.2 GB, 81.2 GB] alltoall=512 KB
The batch=4+4 notation reflects per-instance batch (instance 0 +
instance 1). The alltoall field shows the wave-synchronized
ALLTOALL message size, which equals
max(total_len_per_instance) * hidden_size * fp_size.
What's interesting
- Expert weights split across instances, not just GPUs. With
ep_size=2and 128 total experts, each instance holds 64. Per-GPU weight memory is roughly halved compared to single-instance EP=1. - One ASTRA-Sim process for both instances. Wave-synchronized
scheduling across the DP group means the simulator generates
.etfiles for both instances that share stream IDs on the ALLTOALL, forcing ASTRA-Sim to block until both NPUs reach the collective. - Idle instance gets a dummy batch. When one instance has no pending work, the scheduler synthesizes a 1-decode-token batch so it can still participate in the wave's ALLTOALL. Same when one instance finishes early, it keeps generating dummies until the group does.
comm_sizeis synchronized. Even if instance A has a much bigger batch than instance B, both pass the same ALLTOALL size to ASTRA-Sim, set to the max, so the network model sees the heavier side.
Related examples
- Expert parallel: the same MoE model on a single instance (EP within one TP group).
- Multi-instance LOAD routing - the non-DP version of multi-instance: independent replicas without expert sharing.
- Cluster config explained -
the field-level walkthrough of how
dp_grouplights up the 2D topology.