Skip to main content

Trace file format

The simulator's trace_generator.py writes a per-batch text trace that the Chakra converter then reads to produce the .et file ASTRA-Sim consumes. This page is the field-by-field spec of that text trace.

For the internals of how this trace is produced, see Simulator → Trace generation.

File location

astra-sim/inputs/trace/<hardware>/<model>/instance_<i>_batch_<b>.txt

One file per (instance × batch). Regenerated every iteration.

File structure

COLOCATED model_parallel_NPU_group: {npu_group}
{num_layers}
Layername comp_time input_loc input_size weight_loc weight_size output_loc output_size comm_type comm_size misc
embedding_0 5621 REMOTE:0 40 LOCAL 1050673152 LOCAL 81920 NONE 0 NONE
layernorm_0 1240 LOCAL 81920 LOCAL 8192 LOCAL 81920 NONE 0 NONE
qkv_proj_0 8324 LOCAL 81920 LOCAL 25165824 LOCAL 245760 NONE 0 NONE
...
sampler_291 25933 LOCAL 2565120 LOCAL 0 REMOTE:0 40 NONE 0 NONE

Header (lines 1–3)

LineContentMeaning
1COLOCATED\tmodel_parallel_NPU_group: {npu_group}Trace mode marker. npu_group is the comma-separated list of NPU IDs in this instance
2{num_layers}Number of layer rows that follow
3column header (tab-separated)Field names

Layer rows

Each row has 11 tab-separated fields:

FieldTypeMeaning
LayernamestringCanonical layer name + index (e.g., qkv_proj_0, attention_31)
comp_timeintComputation latency in nanoseconds
input_locenumWhere the input tensor lives (see memory locations)
input_sizeintInput tensor size in bytes
weight_locenumWhere the layer's weights live
weight_sizeintWeight size in bytes
output_locenumWhere the output tensor will be written
output_sizeintOutput tensor size in bytes
comm_typeenumCollective type after this layer (see communication)
comm_sizeintCollective message size in bytes (0 if comm_type is NONE)
miscstringMisc tag (sub-batch interleaving, etc.; usually NONE)

Memory locations

The input_loc, weight_loc, and output_loc fields use one of:

ValueMeaningBacked by
LOCALNPU memoryper-instance NPU
REMOTE:{node_id}CPU memory on the named nodeper-node cpu_mem
CXL:{device_id}CXL device memorytop-level cxl_mem block
STORAGEStorage tier (used by power model only)(none)

The numeric IDs match the C++ enum in astra-sim/astra-sim/system/AstraMemoryAPI.hh:

SymbolValue
LOCAL1
REMOTE2
CXL3
STORAGE4

These must stay in sync between the trace and the C++ enum; mismatches cause silent miscounting.

First and last layer must use REMOTE

The Chakra converter emits a MEM_LOAD_NODE from the first layer's input_loc and a MEM_STORE_NODE from the last layer's output_loc. Both must be REMOTE:{node_id} (CPU side): the simulator models the request entering / leaving the NPU as a host-side transfer.

This is why embedding_0 has input_loc=REMOTE:0 and sampler_* has output_loc=REMOTE:0 in the example above.

Communication types

The comm_type field selects the collective ASTRA-Sim runs after this layer:

ValueMeaningWhen emitted
NONENo collectiveMost layers
ALLREDUCEAll-reduce across the involved dimAfter o_proj and down_proj (TP > 1)
ALLTOALLAll-to-all dispatch / combineAround the MoE block (EP-aware)

Dimension scoping

For multi-dimensional ASTRA-Sim topologies (DP+EP layouts), the comm_type can include a dimension scope suffix:

SuffixMeaning
ALLREDUCEDefault, all dims involved
ALLREDUCE:1,0Dim 0 = involved (True), dim 1 = not (False). i.e., TP-only ALLREDUCE in a 2D [tp, dp] topology
ALLTOALL:0,1Dim 0 = not involved, dim 1 = involved. i.e., EP-only ALLTOALL across the DP group

The Chakra converter parses these via _parse_comm_type and writes the involved_dim BoolList into the .et file. ASTRA-Sim's Workload::issue_comm() reads the BoolList and routes the collective on the named dimensions.

Special markers

Some layers are wrapped by markers:

EXPERT {i} / EXPERT END (MoE)

Wrap the per-rank expert compute:

EXPERT 0
moe_expert_local_3_rank0 1842 LOCAL 524288 LOCAL 9437184 LOCAL 524288 ALLTOALL 524288 NONE
EXPERT END
EXPERT 1
moe_expert_local_3_rank1 1804 LOCAL 524288 LOCAL 9437184 LOCAL 524288 ALLTOALL 524288 NONE
EXPERT END

ASTRA-Sim runs each EXPERT {i} block on rank i in parallel, synchronizing at the surrounding ALLTOALLs.

PIM {channel} / PIM END (PIM offload)

Wrap PIM-side attention compute:

PIM 0
pim_attention_3 4126 LOCAL 245760 LOCAL 0 LOCAL 245760 NONE 0 NONE
PIM END

Multiple PIM <channel> blocks can appear back-to-back to model multi-channel parallel attention.

Sub-batch interleaving (misc)

When --enable-sub-batch-interleaving is on, layers carry a batch tag in misc:

qkv_proj_3 4128 ... NONE 0 BATCH_1
pim_attention_3 8264 ... NONE 0 BATCH_2
o_proj_3 3845 ... NONE 0 BATCH_1

BATCH_1 and BATCH_2 halves run in parallel, typically GPU compute on one half while PIM attention runs on the other.

Sample full trace (single instance, TP=1, dense model)

COLOCATED model_parallel_NPU_group: 0
228
Layername comp_time input_loc input_size weight_loc weight_size output_loc output_size comm_type comm_size misc
embedding_0 5621 REMOTE:0 40 LOCAL 1050673152 LOCAL 81920 NONE 0 NONE
layernorm_0 1240 LOCAL 81920 LOCAL 8192 LOCAL 81920 NONE 0 NONE
qkv_proj_0 8324 LOCAL 81920 LOCAL 25165824 LOCAL 245760 NONE 0 NONE
rotary_emb_0 2104 LOCAL 245760 LOCAL 0 LOCAL 245760 NONE 0 NONE
attention_0 18327 LOCAL 245760 LOCAL 0 LOCAL 81920 NONE 0 NONE
o_proj_0 7452 LOCAL 81920 LOCAL 8388608 LOCAL 81920 NONE 0 NONE
... (decoder blocks 1..31 elided) ...
final_layernorm 1240 LOCAL 81920 LOCAL 8192 LOCAL 81920 NONE 0 NONE
lm_head 28341 LOCAL 81920 LOCAL 1050673152 LOCAL 2565120 NONE 0 NONE
sampler_291 25933 LOCAL 2565120 LOCAL 0 REMOTE:0 40 NONE 0 NONE

How the Chakra converter consumes this

The Chakra converter (astra-sim/extern/graph_frontend/chakra/src/converter/llm_converter.py) walks the trace and emits Chakra protobuf nodes:

Trace rowChakra node
First layerMEM_LOAD_NODE for the input transfer
Each compute rowCOMP_NODE keyed by comp_time
Last layerMEM_STORE_NODE for the output transfer
comm_type != NONECOMM_COLL_NODE with optional involved_dim BoolList
EXPERT {i} blockSub-graph run on rank i
PIM <channel> blockSub-graph routed to the PIM device

The .et file is what controller.write_flush then sends to ASTRA-Sim.

Gotchas

  1. comp_time is nanoseconds in the trace but the underlying profile CSVs use microseconds. The conversion happens in _load_perf_db() at simulator startup.
  2. Tab-separated, not space. Mixing tabs and spaces breaks the Chakra parser silently.
  3. Don't hand-edit production traces. They're regenerated every iteration; manual edits get clobbered. To inject custom timings, modify the profile CSVs or the trace generator.
  4. comm_size is the total payload, not per-rank. ASTRA-Sim divides by the number of nodes in the ring internally.

What's next