Expert parallel (MoE)

What this demonstrates: sharding MoE experts across GPUs within a single instance, and adding ALLTOALL collectives around the MoE block.

For mixture-of-experts models, you have two options for spreading work across GPUs: TP (shard each linear within an expert) or EP (place different experts on different GPUs). EP is the natural fit for MoE because it touches only the experts; the rest of the layer still uses TP.

ep_size and tp_size share the same GPUs by default, TP runs the dense parts (qkv, o_proj, gate_up_proj, down_proj), EP runs the experts.

Prerequisites

Simulator container set up
Bundled RTXPRO6000 profile for Qwen/Qwen3-30B-A3B-Instruct-2507

Cluster config

configs/cluster/single_node_moe_single_instance.json:

configs/cluster/single_node_moe_single_instance.json
{
  "num_nodes": 1,
  "link_bw": 16,
  "link_latency": 20000,
  "nodes": [
    {
      "num_instances": 1,
      "cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
      "instances": [
        {
          "model_name": "Qwen/Qwen3-30B-A3B-Instruct-2507",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "num_npus": 2,
          "tp_size": 2,
          "ep_size": 2,
          "pd_type": null
        }
      ]
    }
  ]
}

The MoE-specific knob:

ep_size: 2: experts split across the 2 GPUs in this instance.
TP=2 + EP=2 share the same 2 GPUs. Dense layers run TP-ALLREDUCE on the same pair, MoE layers run EP-ALLTOALL.

Qwen3-30B-A3B has 128 experts, so each GPU holds 64. The constraint ep_size divides num_local_experts is checked at startup.

Run

python -m serving \
  --cluster-config 'configs/cluster/single_node_moe_single_instance.json' \
  --dtype bfloat16 --block-size 16 \
  --dataset 'workloads/example_trace.jsonl' \
  --output 'outputs/moe_ep2_run.csv' \
  --log-interval 1.0

By default, expert routing uses --expert-routing-policy COPY: the fast block-copy path. Use --expert-routing-policy RR or RAND to mimic real per-token random routing (slower simulation).

Expected output

[INFO] step=10 batch=4 prompt_t=900 tok/s decode_t=320 tok/s npu_mem=72.1 GB
[INFO] step=11 batch=4 prompt_t=920 tok/s decode_t=330 tok/s npu_mem=72.1 GB

Compared to a hypothetical TP=2 EP=1 run on the same model: each forward pass now adds two ALLTOALL phases (dispatch tokens to expert GPUs and gather them back) instead of TP-ALLREDUCE on every dense linear. With only 8 active experts per token, the per-token compute shrinks substantially.

What's interesting

Expert weight memory halves vs. EP=1: each GPU holds 64 of the 128 experts. With Qwen3-30B-A3B's expert size, that's a multi-GB saving per GPU.
ALLTOALL replaces TP-ALLREDUCE in the MoE block. The per-iteration comm_size for ALLTOALL is local_tokens * hidden_size * fp: proportional to active tokens, not weights - so latency scales with batch size, not model size.
Activated experts per token matters more than total experts. Qwen3-30B-A3B activates 8 of 128, that's the load on the inner expert kernels, regardless of EP degree.

DP+EP MoE: extend EP across multiple instances. This is what you reach for when EP needs to grow past one instance's GPUs.
Tensor parallel: the dense-model counterpart.

Prerequisites​

Cluster config​

Run​

Expected output​

What's interesting​

Related examples​

Prerequisites

Cluster config

Run

Expected output

What's interesting

Related examples