CXL extended memory
What this demonstrates: modeling CXL memory devices as a third tier between NPU and CPU memory, with per-layer / per-block weight placement rules.
Compute Express Link (CXL) memory expansion lets you attach extra DRAM to a host over the PCIe-derived CXL.mem protocol. Bandwidth is lower and latency is higher than HBM (or even DDR5 DIMMs), but capacity can go to TB-scale, which makes it interesting for stretching memory budgets on huge models.
LLMServingSim models CXL as a separate memory tier with explicit placement rules: you decide which layer's weights and which KV blocks live on which CXL device.
Prerequisites
- Simulator container set up
- Bundled RTXPRO6000 profile for
meta-llama/Llama-3.1-8B
Cluster config
configs/cluster/single_node_cxl_instance.json: note the new
top-level cxl_mem block plus the placement block on the instance:
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 1,
"cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
"instances": [
{
"model_name": "meta-llama/Llama-3.1-8B",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"pd_type": null,
"tp_size": 1,
"placement": {
"default": {
"weights": "cxl:0",
"kv_loc": "npu",
"kv_evict_loc": "cpu"
},
"blocks": [
{ "blocks": "0-3", "weights": "cxl:0" },
{ "blocks": "4-7", "weights": "cxl:1" },
{ "blocks": "8,9,10","weights": "cxl:2" },
{ "blocks": "11-23", "weights": "cxl:3" },
{ "blocks": "24-31", "weights": "cxl:0" }
],
"layers": {
"embedding": {"weights": "cxl:1"},
"final_layernorm": {"weights": "cxl:2"},
"lm_head": {"weights": "cxl:3"}
}
}
}
]
}
],
"cxl_mem": {
"mem_size": 1024,
"mem_latency": 250,
"mem_bw": 60,
"num_devices": 4
}
}
The two new pieces:
cxl_mem (top-level)
| Field | Meaning |
|---|---|
mem_size | Capacity per device in GB |
mem_bw | Bandwidth per device in GB/s |
mem_latency | Access latency in ns |
num_devices | How many CXL devices (cxl:0 through cxl:N-1) |
placement (per-instance)
Tells the simulator where each weight and each KV block lives.
defaultapplies to layers / blocks not explicitly mentioned.blockslists rules per range of decoder blocks (e.g.,"0-3","4-7","8,9,10","11-23": comma- and dash-separated).layerslists rules per named layer (e.g.,embedding,final_layernorm,lm_head: the canonical layer names).
Each rule sets:
weights:npu,cpu, orcxl:<id>. Where the layer's weights live.kv_loc: where active KV blocks live.kv_evict_loc: where evicted KV blocks spill to.
The above config spreads 32 decoder blocks of weights across 4 CXL devices in roughly equal chunks, while keeping KV cache on NPU.
Run
python -m serving \
--cluster-config 'configs/cluster/single_node_cxl_instance.json' \
--dtype float16 --block-size 16 \
--dataset 'workloads/example_trace.jsonl' \
--output 'outputs/cxl_run.csv' \
--log-interval 1.0
Expected output
[INFO] step=10 batch=4 prompt_t=620 tok/s decode_t=180 tok/s
npu_mem=12.4 GB cxl_mem=[3.2 GB, 3.1 GB, 3.1 GB, 3.2 GB]
[INFO] step=11 batch=4 prompt_t=640 tok/s decode_t=190 tok/s
npu_mem=12.4 GB cxl_mem=[3.2 GB, 3.1 GB, 3.1 GB, 3.2 GB]
npu_mem is dramatically lower than the NPU-only baseline because
weights live on CXL. cxl_mem is reported per-device.
What's interesting
- Weight memory becomes elastic. Llama-3.1-8B weights (~16 GB at
bf16) no longer compete with KV cache for the NPU's 96 GB. The
trade-off is decode TPOT: each weight load now pays the CXL
round-trip (
mem_latency: 250 ns) plus the bandwidth gap (60 GB/s vs HBM's 1597 GB/s). - Multiple CXL devices = striped bandwidth. The 4-device example
approximates
4 × 60 = 240 GB/saggregate weight bandwidth, still far below HBM, but bigger weights now fit. - Per-layer placement is a real knob. Embedding and
lm_headare bandwidth-heavy on every step; offloading them to CXL hurts more than offloading mid-decoder weights. The provided config is deliberately suboptimal in this respect to show what the rules look like; rebalance for your workload. - KV cache stays on NPU here, but you can also model
kv_loc: "cxl:0"to put KV cache in CXL, useful for very-long- context decode workloads at the cost of TPOT.
Related examples
- Prefix caching: the prefix pool can also
live in CXL (
--prefix-storage CXL).
Where to learn more
- Memory location enums (
LOCAL,REMOTE,CXL,STORAGE) live inastra-sim/astra-sim/system/AstraMemoryAPI.hhand must match the Python side inserving/core/memory_model.py. - The trace generator emits
CXL:{id}location tags on each layer'sweight_locfield. See Trace file format.