Skip to main content

CXL extended memory

What this demonstrates: modeling CXL memory devices as a third tier between NPU and CPU memory, with per-layer / per-block weight placement rules.

Compute Express Link (CXL) memory expansion lets you attach extra DRAM to a host over the PCIe-derived CXL.mem protocol. Bandwidth is lower and latency is higher than HBM (or even DDR5 DIMMs), but capacity can go to TB-scale, which makes it interesting for stretching memory budgets on huge models.

LLMServingSim models CXL as a separate memory tier with explicit placement rules: you decide which layer's weights and which KV blocks live on which CXL device.

Prerequisites

  • Simulator container set up
  • Bundled RTXPRO6000 profile for meta-llama/Llama-3.1-8B

Cluster config

configs/cluster/single_node_cxl_instance.json: note the new top-level cxl_mem block plus the placement block on the instance:

configs/cluster/single_node_cxl_instance.json (excerpt)
{
"num_nodes": 1,
"link_bw": 16,
"link_latency": 20000,
"nodes": [
{
"num_instances": 1,
"cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
"instances": [
{
"model_name": "meta-llama/Llama-3.1-8B",
"hardware": "RTXPRO6000",
"npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
"pd_type": null,
"tp_size": 1,
"placement": {
"default": {
"weights": "cxl:0",
"kv_loc": "npu",
"kv_evict_loc": "cpu"
},
"blocks": [
{ "blocks": "0-3", "weights": "cxl:0" },
{ "blocks": "4-7", "weights": "cxl:1" },
{ "blocks": "8,9,10","weights": "cxl:2" },
{ "blocks": "11-23", "weights": "cxl:3" },
{ "blocks": "24-31", "weights": "cxl:0" }
],
"layers": {
"embedding": {"weights": "cxl:1"},
"final_layernorm": {"weights": "cxl:2"},
"lm_head": {"weights": "cxl:3"}
}
}
}
]
}
],
"cxl_mem": {
"mem_size": 1024,
"mem_latency": 250,
"mem_bw": 60,
"num_devices": 4
}
}

The two new pieces:

cxl_mem (top-level)

FieldMeaning
mem_sizeCapacity per device in GB
mem_bwBandwidth per device in GB/s
mem_latencyAccess latency in ns
num_devicesHow many CXL devices (cxl:0 through cxl:N-1)

placement (per-instance)

Tells the simulator where each weight and each KV block lives.

  • default applies to layers / blocks not explicitly mentioned.
  • blocks lists rules per range of decoder blocks (e.g., "0-3", "4-7", "8,9,10", "11-23": comma- and dash-separated).
  • layers lists rules per named layer (e.g., embedding, final_layernorm, lm_head: the canonical layer names).

Each rule sets:

  • weights: npu, cpu, or cxl:<id>. Where the layer's weights live.
  • kv_loc: where active KV blocks live.
  • kv_evict_loc: where evicted KV blocks spill to.

The above config spreads 32 decoder blocks of weights across 4 CXL devices in roughly equal chunks, while keeping KV cache on NPU.

Run

python -m serving \
--cluster-config 'configs/cluster/single_node_cxl_instance.json' \
--dtype float16 --block-size 16 \
--dataset 'workloads/example_trace.jsonl' \
--output 'outputs/cxl_run.csv' \
--log-interval 1.0

Expected output

[INFO] step=10 batch=4 prompt_t=620 tok/s decode_t=180 tok/s
npu_mem=12.4 GB cxl_mem=[3.2 GB, 3.1 GB, 3.1 GB, 3.2 GB]
[INFO] step=11 batch=4 prompt_t=640 tok/s decode_t=190 tok/s
npu_mem=12.4 GB cxl_mem=[3.2 GB, 3.1 GB, 3.1 GB, 3.2 GB]

npu_mem is dramatically lower than the NPU-only baseline because weights live on CXL. cxl_mem is reported per-device.

What's interesting

  • Weight memory becomes elastic. Llama-3.1-8B weights (~16 GB at bf16) no longer compete with KV cache for the NPU's 96 GB. The trade-off is decode TPOT: each weight load now pays the CXL round-trip (mem_latency: 250 ns) plus the bandwidth gap (60 GB/s vs HBM's 1597 GB/s).
  • Multiple CXL devices = striped bandwidth. The 4-device example approximates 4 × 60 = 240 GB/s aggregate weight bandwidth, still far below HBM, but bigger weights now fit.
  • Per-layer placement is a real knob. Embedding and lm_head are bandwidth-heavy on every step; offloading them to CXL hurts more than offloading mid-decoder weights. The provided config is deliberately suboptimal in this respect to show what the rules look like; rebalance for your workload.
  • KV cache stays on NPU here, but you can also model kv_loc: "cxl:0" to put KV cache in CXL, useful for very-long- context decode workloads at the cost of TPOT.
  • Prefix caching: the prefix pool can also live in CXL (--prefix-storage CXL).

Where to learn more

  • Memory location enums (LOCAL, REMOTE, CXL, STORAGE) live in astra-sim/astra-sim/system/AstraMemoryAPI.hh and must match the Python side in serving/core/memory_model.py.
  • The trace generator emits CXL:{id} location tags on each layer's weight_loc field. See Trace file format.