Trace file format

The simulator's trace_generator.py writes a per-batch text trace that the Chakra converter then reads to produce the .et file ASTRA-Sim consumes. This page is the field-by-field spec of that text trace.

For the internals of how this trace is produced, see Simulator → Trace generation.

File location

astra-sim/inputs/trace/<hardware>/<model>/instance_<i>_batch_<b>.txt

One file per (instance × batch). Regenerated every iteration.

File structure

COLOCATED		model_parallel_NPU_group: {npu_group}
{num_layers}
Layername    comp_time    input_loc    input_size    weight_loc    weight_size    output_loc    output_size    comm_type    comm_size    misc
embedding_0    5621    REMOTE:0    40    LOCAL    1050673152    LOCAL    81920    NONE    0    NONE
layernorm_0    1240    LOCAL    81920    LOCAL    8192    LOCAL    81920    NONE    0    NONE
qkv_proj_0    8324    LOCAL    81920    LOCAL    25165824    LOCAL    245760    NONE    0    NONE
...
sampler_291    25933    LOCAL    2565120    LOCAL    0    REMOTE:0    40    NONE    0    NONE

Header (lines 1–3)

Line	Content	Meaning
1	`COLOCATED\tmodel_parallel_NPU_group: {npu_group}`	Trace mode marker. `npu_group` is the comma-separated list of NPU IDs in this instance
2	`{num_layers}`	Number of layer rows that follow
3	column header (tab-separated)	Field names

Layer rows

Each row has 11 tab-separated fields:

Field	Type	Meaning
`Layername`	string	Canonical layer name + index (e.g., `qkv_proj_0`, `attention_31`)
`comp_time`	int	Computation latency in nanoseconds
`input_loc`	enum	Where the input tensor lives (see memory locations)
`input_size`	int	Input tensor size in bytes
`weight_loc`	enum	Where the layer's weights live
`weight_size`	int	Weight size in bytes
`output_loc`	enum	Where the output tensor will be written
`output_size`	int	Output tensor size in bytes
`comm_type`	enum	Collective type after this layer (see communication)
`comm_size`	int	Collective message size in bytes (`0` if `comm_type` is `NONE`)
`misc`	string	Misc tag (sub-batch interleaving, etc.; usually `NONE`)

Memory locations

The input_loc, weight_loc, and output_loc fields use one of:

Value	Meaning	Backed by
`LOCAL`	NPU memory	per-instance NPU
`REMOTE:{node_id}`	CPU memory on the named node	per-node `cpu_mem`
`CXL:{device_id}`	CXL device memory	top-level `cxl_mem` block
`STORAGE`	Storage tier (used by power model only)	(none)

The numeric IDs match the C++ enum in astra-sim/astra-sim/system/AstraMemoryAPI.hh:

Symbol	Value
`LOCAL`	1
`REMOTE`	2
`CXL`	3
`STORAGE`	4

These must stay in sync between the trace and the C++ enum; mismatches cause silent miscounting.

First and last layer must use REMOTE

The Chakra converter emits a MEM_LOAD_NODE from the first layer's input_loc and a MEM_STORE_NODE from the last layer's output_loc. Both must be REMOTE:{node_id} (CPU side): the simulator models the request entering / leaving the NPU as a host-side transfer.

This is why embedding_0 has input_loc=REMOTE:0 and sampler_* has output_loc=REMOTE:0 in the example above.

Communication types

The comm_type field selects the collective ASTRA-Sim runs after this layer:

Value	Meaning	When emitted
`NONE`	No collective	Most layers
`ALLREDUCE`	All-reduce across the involved dim	After `o_proj` and `down_proj` (TP > 1)
`ALLTOALL`	All-to-all dispatch / combine	Around the MoE block (EP-aware)

Dimension scoping

For multi-dimensional ASTRA-Sim topologies (DP+EP layouts), the comm_type can include a dimension scope suffix:

Suffix	Meaning
`ALLREDUCE`	Default, all dims involved
`ALLREDUCE:1,0`	Dim 0 = involved (`True`), dim 1 = not (`False`). i.e., TP-only ALLREDUCE in a 2D `[tp, dp]` topology
`ALLTOALL:0,1`	Dim 0 = not involved, dim 1 = involved. i.e., EP-only ALLTOALL across the DP group

The Chakra converter parses these via _parse_comm_type and writes the involved_dim BoolList into the .et file. ASTRA-Sim's Workload::issue_comm() reads the BoolList and routes the collective on the named dimensions.

Special markers

Some layers are wrapped by markers:

`EXPERT {i}` / `EXPERT END` (MoE)

Wrap the per-rank expert compute:

EXPERT 0
moe_expert_local_3_rank0    1842    LOCAL    524288    LOCAL    9437184    LOCAL    524288    ALLTOALL    524288    NONE
EXPERT END
EXPERT 1
moe_expert_local_3_rank1    1804    LOCAL    524288    LOCAL    9437184    LOCAL    524288    ALLTOALL    524288    NONE
EXPERT END

ASTRA-Sim runs each EXPERT {i} block on rank i in parallel, synchronizing at the surrounding ALLTOALLs.

`PIM {channel}` / `PIM END` (PIM offload)

Wrap PIM-side attention compute:

PIM 0
pim_attention_3    4126    LOCAL    245760    LOCAL    0    LOCAL    245760    NONE    0    NONE
PIM END

Multiple PIM <channel> blocks can appear back-to-back to model multi-channel parallel attention.

Sub-batch interleaving (`misc`)

When --enable-sub-batch-interleaving is on, layers carry a batch tag in misc:

qkv_proj_3    4128    ...    NONE    0    BATCH_1
pim_attention_3    8264    ...    NONE    0    BATCH_2
o_proj_3    3845    ...    NONE    0    BATCH_1

BATCH_1 and BATCH_2 halves run in parallel, typically GPU compute on one half while PIM attention runs on the other.

Sample full trace (single instance, TP=1, dense model)

COLOCATED		model_parallel_NPU_group: 0
228
Layername	comp_time	input_loc	input_size	weight_loc	weight_size	output_loc	output_size	comm_type	comm_size	misc
embedding_0	5621	REMOTE:0	40	LOCAL	1050673152	LOCAL	81920	NONE	0	NONE
layernorm_0	1240	LOCAL	81920	LOCAL	8192	LOCAL	81920	NONE	0	NONE
qkv_proj_0	8324	LOCAL	81920	LOCAL	25165824	LOCAL	245760	NONE	0	NONE
rotary_emb_0	2104	LOCAL	245760	LOCAL	0	LOCAL	245760	NONE	0	NONE
attention_0	18327	LOCAL	245760	LOCAL	0	LOCAL	81920	NONE	0	NONE
o_proj_0	7452	LOCAL	81920	LOCAL	8388608	LOCAL	81920	NONE	0	NONE
... (decoder blocks 1..31 elided) ...
final_layernorm	1240	LOCAL	81920	LOCAL	8192	LOCAL	81920	NONE	0	NONE
lm_head	28341	LOCAL	81920	LOCAL	1050673152	LOCAL	2565120	NONE	0	NONE
sampler_291	25933	LOCAL	2565120	LOCAL	0	REMOTE:0	40	NONE	0	NONE

How the Chakra converter consumes this

The Chakra converter (astra-sim/extern/graph_frontend/chakra/src/converter/llm_converter.py) walks the trace and emits Chakra protobuf nodes:

Trace row	Chakra node
First layer	`MEM_LOAD_NODE` for the input transfer
Each compute row	`COMP_NODE` keyed by `comp_time`
Last layer	`MEM_STORE_NODE` for the output transfer
`comm_type != NONE`	`COMM_COLL_NODE` with optional `involved_dim` BoolList
`EXPERT {i}` block	Sub-graph run on rank `i`
`PIM <channel>` block	Sub-graph routed to the PIM device

The .et file is what controller.write_flush then sends to ASTRA-Sim.

Gotchas

comp_time is nanoseconds in the trace but the underlying profile CSVs use microseconds. The conversion happens in _load_perf_db() at simulator startup.
Tab-separated, not space. Mixing tabs and spaces breaks the Chakra parser silently.
Don't hand-edit production traces. They're regenerated every iteration; manual edits get clobbered. To inject custom timings, modify the profile CSVs or the trace generator.
comm_size is the total payload, not per-rank. ASTRA-Sim divides by the number of nodes in the ring internally.

What's next

Simulator → Trace generation how each row is produced.
Cluster config: placement rules determine weight_loc and kv_loc.

File location​

File structure​

Header (lines 1–3)​

Layer rows​

Memory locations​

First and last layer must use REMOTE​

Communication types​

Dimension scoping​

Special markers​

EXPERT {i} / EXPERT END (MoE)​

PIM {channel} / PIM END (PIM offload)​

Sub-batch interleaving (misc)​

Sample full trace (single instance, TP=1, dense model)​

How the Chakra converter consumes this​

Gotchas​

What's next​