Expert parallel (MoE)
What this demonstrates: sharding MoE experts across GPUs within a single instance, and adding ALLTOALL collectives around the MoE block.
For mixture-of-experts models, you have two options for spreading work across GPUs: TP (shard each linear within an expert) or EP (place different experts on different GPUs). EP is the natural fit for MoE because it touches only the experts; the rest of the layer still uses TP.
ep_size and tp_size share the same GPUs by default, TP runs the
dense parts (qkv, o_proj, gate_up_proj, down_proj), EP runs the
experts.
Prerequisites
- Simulator container set up
- Bundled RTXPRO6000 profile for
Qwen/Qwen3-30B-A3B-Instruct-2507
Cluster config
configs/cluster/single_node_moe_single_instance.json:
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 1,
"cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
"instances": [
{
"model_name": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"num_npus": 2,
"tp_size": 2,
"ep_size": 2,
"pd_type": null
}
]
}
]
}
The MoE-specific knob:
ep_size: 2: experts split across the 2 GPUs in this instance.- TP=2 + EP=2 share the same 2 GPUs. Dense layers run TP-ALLREDUCE on the same pair, MoE layers run EP-ALLTOALL.
Qwen3-30B-A3B has 128 experts, so each GPU holds 64. The
constraint ep_size divides num_local_experts is checked at startup.
Run
python -m serving \
--cluster-config 'configs/cluster/single_node_moe_single_instance.json' \
--dtype bfloat16 --block-size 16 \
--dataset 'workloads/example_trace.jsonl' \
--output 'outputs/moe_ep2_run.csv' \
--log-interval 1.0
By default, expert routing uses --expert-routing-policy COPY: the
fast block-copy path. Use --expert-routing-policy RR or RAND to
mimic real per-token random routing (slower simulation).
Expected output
[INFO] step=10 batch=4 prompt_t=900 tok/s decode_t=320 tok/s npu_mem=72.1 GB
[INFO] step=11 batch=4 prompt_t=920 tok/s decode_t=330 tok/s npu_mem=72.1 GB
Compared to a hypothetical TP=2 EP=1 run on the same model: each forward pass now adds two ALLTOALL phases (dispatch tokens to expert GPUs and gather them back) instead of TP-ALLREDUCE on every dense linear. With only 8 active experts per token, the per-token compute shrinks substantially.
What's interesting
- Expert weight memory halves vs. EP=1: each GPU holds 64 of the 128 experts. With Qwen3-30B-A3B's expert size, that's a multi-GB saving per GPU.
- ALLTOALL replaces TP-ALLREDUCE in the MoE block. The
per-iteration
comm_sizefor ALLTOALL islocal_tokens * hidden_size * fp: proportional to active tokens, not weights - so latency scales with batch size, not model size. - Activated experts per token matters more than total experts. Qwen3-30B-A3B activates 8 of 128, that's the load on the inner expert kernels, regardless of EP degree.
Related examples
- DP+EP MoE: extend EP across multiple instances. This is what you reach for when EP needs to grow past one instance's GPUs.
- Tensor parallel: the dense-model counterpart.