CXL extended memory

What this demonstrates: modeling CXL memory devices as a third tier between NPU and CPU memory, with per-layer / per-block weight placement rules.

Compute Express Link (CXL) memory expansion lets you attach extra DRAM to a host over the PCIe-derived CXL.mem protocol. Bandwidth is lower and latency is higher than HBM (or even DDR5 DIMMs), but capacity can go to TB-scale, which makes it interesting for stretching memory budgets on huge models.

LLMServingSim models CXL as a separate memory tier with explicit placement rules: you decide which layer's weights and which KV blocks live on which CXL device.

Prerequisites

Simulator container set up
Bundled RTXPRO6000 profile for meta-llama/Llama-3.1-8B

Cluster config

configs/cluster/single_node_cxl_instance.json: note the new top-level cxl_mem block plus the placement block on the instance:

configs/cluster/single_node_cxl_instance.json (excerpt)
{
  "num_nodes": 1,
  "link_bw": 16,
  "link_latency": 20000,
  "nodes": [
    {
      "num_instances": 1,
      "cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
      "instances": [
        {
          "model_name": "meta-llama/Llama-3.1-8B",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "pd_type": null,
          "tp_size": 1,
          "placement": {
            "default": {
              "weights": "cxl:0",
              "kv_loc": "npu",
              "kv_evict_loc": "cpu"
            },
            "blocks": [
              { "blocks": "0-3",   "weights": "cxl:0" },
              { "blocks": "4-7",   "weights": "cxl:1" },
              { "blocks": "8,9,10","weights": "cxl:2" },
              { "blocks": "11-23", "weights": "cxl:3" },
              { "blocks": "24-31", "weights": "cxl:0" }
            ],
            "layers": {
              "embedding":       {"weights": "cxl:1"},
              "final_layernorm": {"weights": "cxl:2"},
              "lm_head":         {"weights": "cxl:3"}
            }
          }
        }
      ]
    }
  ],
  "cxl_mem": {
    "mem_size": 1024,
    "mem_latency": 250,
    "mem_bw": 60,
    "num_devices": 4
  }
}

The two new pieces:

`cxl_mem` (top-level)

Field	Meaning
`mem_size`	Capacity per device in GB
`mem_bw`	Bandwidth per device in GB/s
`mem_latency`	Access latency in ns
`num_devices`	How many CXL devices (`cxl:0` through `cxl:N-1`)

`placement` (per-instance)

Tells the simulator where each weight and each KV block lives.

default applies to layers / blocks not explicitly mentioned.
blocks lists rules per range of decoder blocks (e.g., "0-3", "4-7", "8,9,10", "11-23": comma- and dash-separated).
layers lists rules per named layer (e.g., embedding, final_layernorm, lm_head: the canonical layer names).

Each rule sets:

weights: npu, cpu, or cxl:<id>. Where the layer's weights live.
kv_loc: where active KV blocks live.
kv_evict_loc: where evicted KV blocks spill to.

The above config spreads 32 decoder blocks of weights across 4 CXL devices in roughly equal chunks, while keeping KV cache on NPU.

Run

python -m serving \
  --cluster-config 'configs/cluster/single_node_cxl_instance.json' \
  --dtype float16 --block-size 16 \
  --dataset 'workloads/example_trace.jsonl' \
  --output 'outputs/cxl_run.csv' \
  --log-interval 1.0

Expected output

[INFO] step=10 batch=4 prompt_t=620 tok/s decode_t=180 tok/s
       npu_mem=12.4 GB cxl_mem=[3.2 GB, 3.1 GB, 3.1 GB, 3.2 GB]
[INFO] step=11 batch=4 prompt_t=640 tok/s decode_t=190 tok/s
       npu_mem=12.4 GB cxl_mem=[3.2 GB, 3.1 GB, 3.1 GB, 3.2 GB]

npu_mem is dramatically lower than the NPU-only baseline because weights live on CXL. cxl_mem is reported per-device.

What's interesting

Weight memory becomes elastic. Llama-3.1-8B weights (~16 GB at bf16) no longer compete with KV cache for the NPU's 96 GB. The trade-off is decode TPOT: each weight load now pays the CXL round-trip (mem_latency: 250 ns) plus the bandwidth gap (60 GB/s vs HBM's 1597 GB/s).
Multiple CXL devices = striped bandwidth. The 4-device example approximates 4 × 60 = 240 GB/s aggregate weight bandwidth, still far below HBM, but bigger weights now fit.
Per-layer placement is a real knob. Embedding and lm_head are bandwidth-heavy on every step; offloading them to CXL hurts more than offloading mid-decoder weights. The provided config is deliberately suboptimal in this respect to show what the rules look like; rebalance for your workload.
KV cache stays on NPU here, but you can also model kv_loc: "cxl:0" to put KV cache in CXL, useful for very-long- context decode workloads at the cost of TPOT.

Prefix caching: the prefix pool can also live in CXL (--prefix-storage CXL).

Where to learn more

Memory location enums (LOCAL, REMOTE, CXL, STORAGE) live in astra-sim/astra-sim/system/AstraMemoryAPI.hh and must match the Python side in serving/core/memory_model.py.
The trace generator emits CXL:{id} location tags on each layer's weight_loc field. See Trace file format.

Prerequisites​

Cluster config​

cxl_mem (top-level)​

placement (per-instance)​

Run​

Expected output​

What's interesting​

Related examples​

Where to learn more​