Skip to main content

Cluster config explained

Every simulation in LLMServingSim is driven by one JSON file: a cluster config. It captures the entire hardware topology, how many nodes, how many instances per node, what GPU each instance runs on, how memory is laid out, and how the model is parallelized.

Once you understand this file, every example in this section is a small variation on the same shape.

The minimum viable config

This is configs/cluster/single_node_single_instance.json: the smallest config that runs:

configs/cluster/single_node_single_instance.json
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 1,
"cpu_mem": {
"mem_size": 512,
"mem_bw": 256,
"mem_latency": 0
},
"instances": [
{
"model_name": "meta-llama/Llama-3.1-8B",
"hardware": "RTXPRO6000",
"npu_mem": {
"mem_size": 96,
"mem_bw": 1597,
"mem_latency": 0
},
"num_npus": 1,
"tp_size": 1,
"pd_type": null
}
]
}
]
}

That's: one node, one instance, running Llama-3.1-8B on one RTXPRO6000 GPU with TP=1 (no parallelism).

The file has three nested levels. We'll walk through them top-down.

1. Top level, the cluster

{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [...]
}
FieldTypeMeaning
num_nodesintNumber of physical nodes in the cluster
link_bwfloatInter-node link bandwidth in GB/s
link_latencyfloatInter-node link latency in ns
nodesarrayOne entry per node (length must match num_nodes)

For multi-node setups (e.g., two boxes in a rack), set num_nodes: 2 and add a second node entry. Inter-node communication uses link_bw / link_latency.

Optional top-level fields:

FieldUsed for
cxl_memCXL memory expansion config (see CXL memory tier)

2. Per-node level

{
"num_instances": 1,
"cpu_mem": {
"mem_size": 512,
"mem_bw": 256,
"mem_latency": 0
},
"instances": [...]
}
FieldTypeMeaning
num_instancesintHow many serving instances live on this node
cpu_mem.mem_sizefloatHost CPU memory capacity (GB)
cpu_mem.mem_bwfloatCPU memory bandwidth (GB/s)
cpu_mem.mem_latencyfloatCPU memory latency (ns)
instancesarrayOne entry per instance (length = num_instances)

Optional per-node fields:

FieldUsed for
cpu_mem.pim_configName of a PIM device config in configs/pim/ (see PIM attention offload)
powerPower model coefficients (see Power modeling)

3. Per-instance level

This is where the real work happens. An instance is one independent LLM serving replica, a model, a parallelism strategy, and a chunk of GPUs.

{
"model_name": "meta-llama/Llama-3.1-8B",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"num_npus": 1,
"tp_size": 1,
"pd_type": null
}

Required fields

FieldTypeMeaning
model_namestringHugging Face model id. Must match a config in configs/model/{model_name}.json
hardwarestringHardware tag. Must match profiler/perf/{hardware}/ (e.g. RTXPRO6000, H100)
npu_memobjectPer-GPU memory: mem_size (GB), mem_bw (GB/s), mem_latency (ns)
pd_typestring|null"prefill", "decode", or null for combined prefill+decode

Parallelism fields (at least one required)

FieldTypeDefaultMeaning
num_npusintinferredTotal GPUs for this instance, equals tp_size * pp_size
tp_sizeintinferredTensor-parallel degree
pp_sizeint1Pipeline-parallel degree
ep_sizeinttp_size (MoE) / 1 (dense)Expert-parallel degree
dp_groupstring|nullnullInstances with the same string form a DP group and share experts via cross-instance ALLTOALL

You only need to provide one of num_npus or tp_size. The other gets inferred. So:

  • tp_size: 4num_npus = 4 * pp_size (PP defaults to 1, so 4)
  • num_npus: 4, pp_size: 2tp_size = 2

Parallelism rules to remember:

  • num_npus == tp_size * pp_size (always)
  • TP and EP share the same GPUs: dense layers do TP-ALLREDUCE, MoE layers do EP-ALLTOALL
  • Without dp_group: ep_size <= tp_size
  • With dp_group: EP can scale beyond a single instance's GPUs (see DP+EP example)
  • For MoE models: ep_size must divide num_local_experts

Optional advanced fields

FieldUsed for
placementPer-layer / per-block weight + KV-cache placement (see CXL memory)

DP+EP, the topology that needs more explanation

When multiple instances share the same dp_group, they form a 2D ASTRA-Sim topology sized [tp_size, dp_group_size]. Collectives are scoped per dimension:

  • TP ALLREDUCE runs on dim 0 only (within an instance)
  • EP ALLTOALL runs on dim 1 only (across the DP group)

All instances in a DP group share one ASTRA-Sim process with wave-synchronized scheduling. MoE expert weights are sharded by ep_size: each instance holds num_local_experts / ep_size experts.

Concrete example: Qwen3-30B-A3B has 128 experts. With tp_size=1, ep_size=2, dp_group="A" and two instances, each holds 64 experts. Per-token activation crosses the DP group via ALLTOALL.

This is the DP+EP MoE example.

What config_builder.py does with this file

When you launch the simulator, serving/core/config_builder.py reads the cluster config and generates three ASTRA-Sim input files under astra-sim/inputs/:

Generated fileDriven by
network/network.ymllink_bw, link_latency, [tp_size, dp_group_size] topology
system/system.jsonMemory bandwidths, scheduling policy, per-dim collective implementations
memory/memory_expansion.jsonCXL devices and any extended memory tiers

You don't write these by hand, they're regenerated on every run from the cluster config.

Provided configs

The repo ships 13 worked configs under configs/cluster/. Each example in this section uses one of them:

ConfigUsed by
single_node_single_instance.jsonTensor parallel (with tp_size=2)
single_node_multi_instance.jsonMulti-instance LOAD routing
single_node_pd_instance.jsonPrefill/decode split
single_node_moe_single_instance.jsonExpert parallel
single_node_moe_dp_ep_instance.jsonDP+EP MoE
single_node_cxl_instance.jsonCXL memory tier
single_node_pim_instance.jsonPIM attention offload
single_node_power_instance.jsonPower modeling
dual_node_multi_instance.jsonMulti-node setups
......

What's next

Now that you can read a cluster config, pick an example and see how the same shape produces very different topologies: