Continuous batching
The scheduler is the heart of each serving instance. Every iteration
of the main loop calls scheduler.schedule(current, sys) and gets
back a Batch (or None). The scheduler enforces the same
constraints vLLM does: token budget, sequence count cap, and
optionally chunked prefill. This page walks through the rules.
Need the configuration knobs? See Reference → CLI flags for the flag list. This page explains what each flag does internally.
Two scheduling paths
Depending on whether --enable-prefix-caching is on (default), the
scheduler takes one of two code paths inside serving/core/scheduler.py:
| Flag | Method | What changes |
|---|---|---|
--no-enable-prefix-caching | schedule_base | Pure token-budget scheduler. Each request always runs from token 0. |
--enable-prefix-caching (default) | schedule_with_prefix | Same plus a RadixCache lookup that returns hit_len per request. |
In both paths, the constraints are the same:
- Sequence cap:
len(batch) <= --max-num-seqs. Default128. Set to0for unbounded. - Token budget:
sum(tokens_to_run_this_step) <= --max-num-batched-tokens. Default2048. - Per-request cap (chunked prefill):
tokens_for_this_request_this_step <= --long-prefill-token-threshold. Default0= disabled.
schedule_with_prefix additionally maintains the per-instance
MemoryModel.npu_prefix_cache (RadixCache). Details on
Prefix caching.
What the scheduler picks each step
Conceptually, the loop is:
remaining_token_budget = max_num_batched_tokens
batch = []
for request in queue (FIFO, prefill-first if prioritized):
if len(batch) >= max_num_seqs: break
needs_to_run = how_many_tokens_this_request_needs(request)
cap = min(remaining_token_budget,
long_prefill_token_threshold or remaining_token_budget,
needs_to_run)
if cap <= 0:
continue # try the next request
schedule(request, tokens=cap)
remaining_token_budget -= cap
batch.append(request)
return Batch(batch) if batch else None
The "how many tokens this request needs" function differs by request state:
- Prefill, no chunk yet: input length minus any prefix cache hit.
- Prefill, mid-chunk: remaining prompt tokens.
- Decode: always 1.
Chunked prefill
--long-prefill-token-threshold N (or --enable-chunked-prefill
which sets a sensible default) lets the scheduler split a long prefill
across multiple iterations. Without it, a single 32k-token request
hogs the whole budget and TPOT for other in-flight requests
collapses.
Concretely, a request whose remaining prefill is 8000 tokens with
--long-prefill-token-threshold 1024 runs as eight separate
8x1024-token chunks across eight scheduler iterations. The
Request.num_computed_tokens field tracks progress; on each
iteration the scheduler bumps it by however many tokens were just
processed.
Decode steps continue to run concurrently in the same batch, the chunked prefill just keeps long prompts from monopolizing.
Prefill priority
By default, prefill and decode requests share the same FIFO queue.
With --prioritize-prefill, the scheduler reorders the batch so all
prefill requests land first, then decodes fill the remaining budget.
This trades some TPOT for lower TTFT under bursty arrivals, useful in deployments where users care more about "first token latency" than steady-state generation rate.
Pipeline depth (PP)
For pp_size > 1 instances, the scheduler also keeps an inflight
list of batches currently traversing the pipeline. Its length is
capped at pp_size: when the pipeline is full, the scheduler
returns None until ASTRA-Sim drains a stage.
This makes the simulator's PP behavior match production training frameworks (e.g., Megatron) where micro-batches stream through the pipeline.
Where the scheduler stops
The simulator exits when, simultaneously:
- Every scheduler returns
None(no eligible requests). Router.has_pending_requests()returnsFalse(no future arrivals).Router.has_deferred_sessions()returnsFalse(no agentic sessions waiting on tool calls).
If only the third is non-empty, the main loop fast-forwards current
to the next pending arrival time and resumes.
What the scheduler hands back
scheduler.add_done(npu_id, sys, current) is called once per
iteration when ASTRA-Sim reports completion. It returns:
(prompt_throughput, decode_throughput, finished_requests)
prompt_throughputcounts all input tokens including prefix cache hits, matching vLLM's reporting (which also counts cached tokens).decode_throughputcounts only newly generated tokens.finished_requestsis the list of requests that completed during this iteration.
For prefill instances under P/D disaggregation, the main loop hands
finished_requests to router.transfer_prefill_request so the
decode instance picks them up.
Gotchas
-
Prefill plus prefix caching doesn't double-count:
hit_lenis subtracted from the tokens the scheduler actually runs, but added toprompt_throughput. So a 1000-token request with 600 tokens of prefix hit consumes 400 tokens of budget and reports 1000 tokens of prompt throughput. -
--max-num-seqs 0means unlimited, not zero. Useful when you want pure token-budget gating, but watch memory. -
The token budget is shared across prefill + decode. A batch with 64 in-progress decodes and a 1500-token prefill chunk runs 1564 tokens this step. Decode contributions count.
-
Pipeline parallelism currently affects scheduling only. Trace generation still emits the full model; PP scheduling depth prevents over-issuing batches but doesn't yet model inter-stage forwarding cost in detail. If this matters for your research, treat PP results as a lower bound.
What's next
- Prefix caching: what
hit_lenmeans and how the RadixCache decides it. - KV cache & memory: how the scheduler knows when memory is full.