Tensor parallel (TP)
What this demonstrates: sharding a single instance's dense weights across multiple GPUs and adding TP-ALLREDUCE collectives.
The simplest non-trivial parallelism. One serving instance, but the
weights are sliced across N GPUs along the head dimension. Each
attention o_proj and MLP down_proj ends with an ALLREDUCE.
Prerequisites
- Simulator container set up (Simulator setup)
- Bundled RTXPRO6000 profile, no profiling needed for
Qwen3-32Bat TP=2.
Cluster config
configs/cluster/single_node_single_instance.json (lightly edited
from the bundled config to use Qwen/Qwen3-32B at tp_size=2):
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 1,
"cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
"instances": [
{
"model_name": "Qwen/Qwen3-32B",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"num_npus": 2,
"tp_size": 2,
"pd_type": null
}
]
}
]
}
The two fields that make this TP=2 instead of TP=1:
num_npus: 2: two GPUs assigned to this instance.tp_size: 2: both GPUs participate in tensor-parallel ALLREDUCE.
(With pp_size defaulted to 1, num_npus = tp_size * pp_size = 2
holds.)
Run
python -m serving \
--cluster-config 'configs/cluster/single_node_single_instance.json' \
--dtype bfloat16 --block-size 16 \
--dataset 'workloads/example_trace.jsonl' \
--output 'outputs/tp2_run.csv' \
--log-interval 1.0
--dtype bfloat16 matches the profiled variant for Qwen3-32B.
Expected output
Throughput logs land roughly once per second:
[INFO] step=42 batch=8 prompt_t=1.2k tok/s decode_t=420 tok/s npu_mem=88.4 GB
[INFO] step=43 batch=8 prompt_t=1.1k tok/s decode_t=440 tok/s npu_mem=88.4 GB
outputs/tp2_run.csv has one row per finished request. The
canonical schema (12 columns, all times in ns) is documented under
Simulator → Reading the output:
instance id,request id,model,input,output,arrival,end_time,latency,queuing_delay,TTFT,TPOT,ITL
0,0,Qwen/Qwen3-32B,1472,133,4059740,178654321,174594581,3739551,4063716,1306794,"[1306794, ...]"
What's interesting
- GPU memory drops in half vs. TP=1 because each GPU only holds
half the weights. Watch
npu_memin the throughput log, at TP=2 on Qwen3-32B you should see roughly~32 GBof weight load instead of~64 GB. - TTFT goes up slightly vs. TP=1 (when both fit), because each
attention/MLP pair pays one ALLREDUCE round-trip.
link_bw=16 GB/sin the config is the relevant knob, bumping it shrinks the TP collective cost. - TPOT can go down when the model is memory-bound at TP=1: more bandwidth per token outweighs the collective overhead.
Related examples
- Pipeline parallel: split layers across GPUs instead of weights within a layer.
- Expert parallel: TP's MoE counterpart; shards experts across GPUs.
- Multi-instance LOAD routing - scale out by replicating whole TP groups instead of growing one.