Power modeling

What this demonstrates: turning on the per-node power model so the simulator emits live wattage in the throughput log and a per-component energy breakdown at the end of the run.

The power model is opt-in: a node only tracks power when its config includes a power: block. The bundled single_node_power_instance.json is a ready-to-run example.

Prerequisites

Simulator container set up
Bundled RTXPRO6000 profile for meta-llama/Llama-3.1-8B (no extra profiling needed)

Cluster config

configs/cluster/single_node_power_instance.json adds a power: block to the node alongside the usual instances:

configs/cluster/single_node_power_instance.json (excerpt)
{
  "num_nodes": 1,
  "link_bw": 16,
  "link_latency": 20000,
  "nodes": [
    {
      "num_instances": 1,
      "cpu_mem": {"mem_size": 512, "mem_bw": 256, "mem_latency": 0},
      "instances": [
        {
          "model_name": "meta-llama/Llama-3.1-8B",
          "hardware": "RTXPRO6000",
          "npu_mem": {"mem_size": 96, "mem_bw": 1597, "mem_latency": 0},
          "pd_type": null,
          "tp_size": 1
        }
      ],
      "power": {
        "base_node_power": 60,
        "npu": {
          "RTXPRO6000": {
            "idle_power": 35,
            "standby_power": 300,
            "active_power": 600,
            "standby_duration": 18
          }
        },
        "cpu":     {"idle_power": 10, "active_power": 200, "util": 0.15},
        "dram":    {"dimm_size": 32,  "idle_power": 2.0,   "energy_per_bit": 6.0},
        "link":    {"num_links": 1,   "idle_power": 5,     "energy_per_bit": 4.0},
        "nic":     {"num_nics": 1,    "idle_power": 20},
        "storage": {"num_devices": 2, "idle_power": 5}
      }
    }
  ]
}

The npu.<hardware> key looks up power coefficients by the instance's hardware field, so multi-hardware clusters list one entry per hardware type.

For the field-by-field schema (base_node_power, idle_power, standby_duration, energy_per_bit, ...), see Cluster config → power.

Run

python -m serving \
  --cluster-config 'configs/cluster/single_node_power_instance.json' \
  --dtype float16 --block-size 16 \
  --dataset 'workloads/example_trace.jsonl' \
  --output 'outputs/power_run.csv' \
  --log-interval 1.0

No new CLI flag is needed. The presence of the power: block in the cluster config is the trigger; remove the block for a baseline run that doesn't track power.

Expected output

The throughput log gains a power= field (in watts):

[INFO] step=42 batch=8 prompt_t=1.2k tok/s decode_t=420 tok/s
       npu_mem=88.4 GB power=712 W
[INFO] step=43 batch=8 prompt_t=1.1k tok/s decode_t=440 tok/s
       npu_mem=88.4 GB power=698 W

power is the instantaneous total node power summed across NPU / CPU / DRAM / link / NIC / storage / base.

When the run ends, the simulator prints a per-component energy breakdown:

─────── Power summary (node 0) ───────
   NPU active     :   12,453 J  (78%)
   NPU standby    :    1,012 J   (6%)
   NPU idle       :       89 J   (1%)
   CPU            :    1,233 J   (8%)
   DRAM           :      442 J   (3%)
   Link           :      388 J   (2%)
   Base + NIC + storage : 332 J  (2%)
   ─────────────────────────────────
   Total energy   :   15,949 J

The breakdown is the actionable output. A run dominated by NPU active is compute-bound; one with significant NPU idle is under-utilized; one with disproportionate Link energy is ALLREDUCE-bound (worth checking when tp_size > 1).

What's interesting

Throughput vs. wattage trade-offs. Bumping --max-num-seqs raises throughput and NPU active / standby time together, but the slope differs by workload — energy-per-token improves on decode-heavy loads and degrades on prefill-heavy ones.
Standby vs. idle gap. standby_duration (ns after a kernel finishes) determines how often the NPU drops back to idle_power. Bursty workloads spend more time in idle; steady-state workloads stay in standby / active. NPU idle > NPU standby usually means the workload doesn't saturate the GPU.
Base-node power is constant. The host-side draw (base_node_power) doesn't depend on what the simulator is doing; it's the always-on overhead that energy-efficiency comparisons need to factor in.

Sub-batch interleaving — pairs cleanly with the power model. Overlapping PIM attention with GPU compute changes both throughput and the energy breakdown.
CXL memory — adding a cxl_mem device and per-device placement rules adds a cxl_mem=... field to the throughput log; the energy summary then includes CXL transfer energy.

Where to learn more

Simulator → Power model: per-component math, NPU state machine, and how standby_duration factors in.
The implementation lives in serving/core/power_model.py.

Prerequisites​

Cluster config​

Run​

Expected output​

What's interesting​

Related examples​

Where to learn more​

Prerequisites

Cluster config

Run

Expected output

What's interesting

Related examples

Where to learn more