Skip to main content

Expert parallel (MoE)

What this demonstrates: sharding MoE experts across GPUs within a single instance, and adding ALLTOALL collectives around the MoE block.

For mixture-of-experts models, you have two options for spreading work across GPUs: TP (shard each linear within an expert) or EP (place different experts on different GPUs). EP is the natural fit for MoE because it touches only the experts; the rest of the layer still uses TP.

ep_size and tp_size share the same GPUs by default, TP runs the dense parts (qkv, o_proj, gate_up_proj, down_proj), EP runs the experts.

Prerequisites

  • Simulator container set up
  • Bundled RTXPRO6000 profile for Qwen/Qwen3-30B-A3B-Instruct-2507

Cluster config

configs/cluster/single_node_moe_single_instance.json:

configs/cluster/single_node_moe_single_instance.json
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 1,
"cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
"instances": [
{
"model_name": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"num_npus": 2,
"tp_size": 2,
"ep_size": 2,
"pd_type": null
}
]
}
]
}

The MoE-specific knob:

  • ep_size: 2: experts split across the 2 GPUs in this instance.
  • TP=2 + EP=2 share the same 2 GPUs. Dense layers run TP-ALLREDUCE on the same pair, MoE layers run EP-ALLTOALL.

Qwen3-30B-A3B has 128 experts, so each GPU holds 64. The constraint ep_size divides num_local_experts is checked at startup.

Run

python -m serving \
--cluster-config 'configs/cluster/single_node_moe_single_instance.json' \
--dtype bfloat16 --block-size 16 \
--dataset 'workloads/example_trace.jsonl' \
--output 'outputs/moe_ep2_run.csv' \
--log-interval 1.0

By default, expert routing uses --expert-routing-policy COPY: the fast block-copy path. Use --expert-routing-policy RR or RAND to mimic real per-token random routing (slower simulation).

Expected output

[INFO] step=10 batch=4 prompt_t=900 tok/s decode_t=320 tok/s npu_mem=72.1 GB
[INFO] step=11 batch=4 prompt_t=920 tok/s decode_t=330 tok/s npu_mem=72.1 GB

Compared to a hypothetical TP=2 EP=1 run on the same model: each forward pass now adds two ALLTOALL phases (dispatch tokens to expert GPUs and gather them back) instead of TP-ALLREDUCE on every dense linear. With only 8 active experts per token, the per-token compute shrinks substantially.

What's interesting

  • Expert weight memory halves vs. EP=1: each GPU holds 64 of the 128 experts. With Qwen3-30B-A3B's expert size, that's a multi-GB saving per GPU.
  • ALLTOALL replaces TP-ALLREDUCE in the MoE block. The per-iteration comm_size for ALLTOALL is local_tokens * hidden_size * fp: proportional to active tokens, not weights - so latency scales with batch size, not model size.
  • Activated experts per token matters more than total experts. Qwen3-30B-A3B activates 8 of 128, that's the load on the inner expert kernels, regardless of EP degree.
  • DP+EP MoE: extend EP across multiple instances. This is what you reach for when EP needs to grow past one instance's GPUs.
  • Tensor parallel: the dense-model counterpart.