Parallelism mechanics
This page is the runtime side of parallelism: when a batch hits ASTRA-Sim, what collectives fire, where, and how multi-instance DP groups synchronize. The cluster-config angle (which fields turn each of these on) is on Examples → Cluster config explained.
What the simulator can model
| Style | What's parallelized | Collective | Where it fires |
|---|---|---|---|
| TP (tensor) | Linear weights split along head dim | ALLREDUCE | After o_proj and down_proj |
| PP (pipeline) | Decoder layers split across GPU groups | (point-to-point in inflight queue) | At stage boundaries |
| EP (expert) | MoE experts split across ranks | ALLTOALL | Around the MoE block |
| DP+EP | EP across multiple instances | ALLTOALL | Same, but across instance boundaries with wave-sync |
TP and EP can share the same GPUs. DP+EP requires a dp_group
identifier on the cluster config.
TP, ALLREDUCE on every dense layer
When tp_size > 1, the trace generator attaches an ALLREDUCE
COMM_COLL_NODE after each TP-aware dense linear:
o_proj(attention output projection)down_proj(MLP output projection)
These are the two layers where each TP rank holds a different head slice of the output and needs to sum across ranks.
The comm_size on each ALLREDUCE is the full output tensor size
(not per-rank, ASTRA-Sim divides internally based on
nodes_in_ring).
qkv_proj, gate_up_proj, etc. don't need ALLREDUCE because they
split the input along the head dim, those layers' output is
already correctly sharded for the next layer. TP's collective cost
is bound by o_proj + down_proj, two ALLREDUCEs per decoder block.
PP, pipeline stages and inflight
Pipeline parallelism is modeled at the scheduling level rather than
in the trace. The scheduler keeps an inflight list per stage, capped
at pp_size entries. When the pipeline is full, the scheduler
returns None instead of producing a new batch, micro-batches drain
through the stages before more get issued.
This makes overall throughput tracking match production frameworks (Megatron-style 1F1B), but the simulator currently doesn't emit detailed inter-stage P2P traffic in the trace; PP results should be treated as a lower bound on the cost.
EP, ALLTOALL around the MoE block
For MoE models, trace_generator wraps the MoE block with two
ALLTOALL collectives:
... → MoE dispatch ALLTOALL → expert compute → MoE combine ALLTOALL → ...
The dispatch ALLTOALL routes each token to its assigned expert's rank. The combine ALLTOALL gathers expert outputs back to the originating ranks. Both are scoped to the EP dimension.
Each EP rank gets a per-rank latency from
profiler/perf/<hw>/<model>/<variant>/tp1/moe.csv keyed on its
local token count (after dispatch) and the activated experts
per token. Ranks execute in parallel and synchronize at the ALLTOALL
barrier, slower ranks gate the others.
Token routing decisions come from gate_function.py. See
MoE expert routing for the policies.
DP+EP, wave synchronization
This is where the simulator gets clever. When two or more instances
share a dp_group, they form a single coordinated wave. Two
synchronization mechanisms work together:
1. Python-side dp_pending barrier
In __main__.py, a dp_pending dict tracks which DP-group members
have scheduled their batches for the current wave. Trace generation
is deferred until all members have scheduled. When the last
member arrives:
- The simulator computes
dp_sum_total_len = sum(total_len)anddp_max_total_len = max(total_len)across the group. comm_size_alltoallis set todp_max_total_len * hidden_size * fp_size: the max across the group, matching CUDA-graph padding in production MoE serving.- All members generate their traces with the same
comm_size, even if their per-instancetotal_lendiffers.
If one DP member has no pending requests, the scheduler synthesizes a dummy batch (1 decode token) so the wave still runs. When all of one member's real requests have finished but the others haven't, the dummy batches keep flowing until the whole group is done.
2. ASTRA-Sim ALLTOALL barrier
All DP-group instances' .et files share the same workload folder
(dp_<group>_batch<bid>/llm.et) and use matching stream IDs on
the ALLTOALL collectives. ASTRA-Sim's runtime sees the matching IDs
and blocks until both NPUs reach the collective, naturally
implementing the wave-sync at the network layer.
So both halves of the sync, Python deferral on submission, ASTRA-Sim blocking on the collective, together produce a deterministic wave-synchronous schedule.
2D ASTRA-Sim topology and involved_dim
config_builder generates a 2D ASTRA-Sim network when DP groups are
present. The topology is npus_count: [tp_size, dp_group_size].
Collectives are scoped per dimension via the involved_dim BoolList
on each COMM_COLL_NODE:
- TP-ALLREDUCE:
involved_dim = [True, False]: dim 0 only. - EP-ALLTOALL:
involved_dim = [False, True]: dim 1 only when EP spans the DP group;[True, True]if EP also spans TP.
The involved_dim is encoded in the trace's comm_type field with
a :dim0,dim1 suffix:
ALLREDUCE:1,0 # TP only
ALLTOALL:0,1 # EP across DP only
The Chakra converter parses this via _parse_comm_type and writes
the BoolList into the .et file. ASTRA-Sim's Workload::issue_comm
reads it and dispatches the collective only on the involved dims.
The system.json collective implementations need one entry per
topology dim, config_builder generates this automatically:
"all-to-all-implementation": ["ring", "ring"] for 2D.
Communication sizes (ASTRA-Sim semantics)
Every comm_size in the trace is the total data size, not
per-NPU. ASTRA-Sim divides internally by the number of nodes in the
ring (msg_size = data_size / nodes_in_ring).
So:
- ALLREDUCE on
o_proj: pass the full output tensor size (total_len * hidden_size * fp_size). - ALLTOALL for MoE: pass the full activation tensor size
(
total_len * hidden_size * fp_size).
If you see surprisingly fast collectives in your trace logs, check that you're not accidentally passing per-rank sizes, that's a common mistake when extending the trace generator.
When to use which
A rough decision tree (the configuration angle is on Examples → Cluster config explained):
- Single GPU fits the model: TP=1. Done.
- Need more GPUs for memory: start with TP. ALLREDUCE cost grows
with
tp_size, so going past 4-8 is rarely worth it. - Multiple replicas for throughput: add
num_instances(nodp_group). Independent instances behind a router. - MoE model, single instance: add
ep_size = tp_size. Same GPUs, EP-ALLTOALL replaces TP-ALLREDUCE on the MoE block. - MoE, want to scale experts past one instance's GPUs: DP+EP
with
dp_groupset. EP spans instances via wave-sync.
Gotchas
ep_size > tp_sizerequiresdp_group. Otherwise the cluster config builder rejects the spec. EP needs the 2D topology to scale beyond a single instance's GPU count.- Dummy batches are real ASTRA-Sim work. A DP group with one idle instance still pays the ALLTOALL cost on the dummy batch. This is what production looks like, wave-sync is wave-sync.
comm_sizeis synchronized to the max. Even if one DP member's batch is much smaller, the ALLTOALL message size matches the largest member's. This is correct (matches production padding) but worth knowing.- PP doesn't yet model inter-stage forwarding cost in detail. Take PP latency results as a lower bound.
What's next
- MoE expert routing: how tokens get distributed across EP ranks before the dispatch ALLTOALL.
- Examples → DP+EP MoE - a worked-out config that exercises this whole machinery.