Skip to main content

ShareGPT generator

ShareGPT is the de facto standard inference benchmark, a curated dataset of real human ↔ ChatGPT conversations spanning a wide range of prompt lengths and use cases. The bundled generator turns ShareGPT (or any compatible Hugging Face text dataset) into the JSONL format the simulator consumes, with proper tokenization for prefix caching.

Quick run

From the vLLM Docker container at /workspace:

python -m workloads.generators sharegpt \
--model meta-llama/Llama-3.1-8B \
--source shibing624/sharegpt_gpt4 \
--num-reqs 300 --sps 10 --seed 42 \
--output workloads/sharegpt-llama-3.1-8b-300-sps10.jsonl

That produces a flat-format workload with 300 requests arriving at 10 sessions/second on average, tokenized with the Llama-3.1-8B tokenizer.

workloads/examples/ contains ready-to-edit templates for the bundled models, copy and tweak:

ls workloads/examples/
# gen-llama-3.1-8b.sh
# gen-qwen3-30b-a3b.sh
# gen-qwen3-32b.sh

Why use the vLLM container

The generator imports transformers for tokenization and (optionally) vllm for free-generation mode. Both are pre-installed in the vLLM Docker image. Run from inside scripts/docker-vllm.sh and you don't need to manage Python deps yourself.

For gated models (Llama 3.x, etc.), set HF_TOKEN before launching the container, see Installation → vLLM setup.

Options, grouped

Source and model

FlagDefaultMeaning
--model(required)HuggingFace model id; used for tokenization (and optionally free-generation)
--sourceshibing624/sharegpt_gpt4HF dataset id or local path. Any dataset with conversations field works

Sampling

FlagDefaultMeaning
--num-reqs(required)How many requests / sessions to emit
--sps(required)Sessions per simulated second (Poisson arrival)
--seed42RNG seed for sampling and arrival times
--first-arrival-sec0Offset for the first request's arrival time

Length filters

Drop requests outside these ranges from the source dataset:

FlagDefaultMeaning
--min-input-toks0Minimum prompt tokens (after tokenization)
--max-input-toks16384Maximum prompt tokens
--min-output-toks0Minimum output tokens
--max-output-toks16384Maximum output tokens
--max-kv-toks16384Cap input + output tokens (KV-cache footprint)
--max-sessions5000Cap the number of source sessions sampled before filtering

A reasonable starting point: --min-input-toks 256 --min-output-toks 512 filters out very short conversations that aren't representative of real serving traffic.

Fixed-length mode

For controlled stress tests, fix the prompt and output lengths:

FlagDefaultMeaning
--fix-lenoffEnable fixed-length mode
--fix-input-length128Prompt tokens
--fix-output-length512Output tokens

In this mode, the generator still pulls real conversations from the source dataset for prefix-cache realism, but truncates / pads each to the fixed lengths.

Pulse arrival pattern

A burst-mode arrival pattern that approximates the "everyone hits the API at the top of the hour" production phenomenon:

FlagDefaultMeaning
--pulseoffEnable pulse mode
--pulse-n10Number of requests per pulse
--pulse-delay-sec60Time between pulses
--pulse-poissonoffWithin each pulse, use Poisson arrivals at the configured --sps instead of all-at-once

Without --pulse-poisson, pulse arrivals all fire at the start of each pulse window, useful for testing the simulator's burst-handling behavior.

vLLM free-generation mode (optional)

Instead of using the source dataset's response field for output tokens, regenerate outputs with vLLM. This produces outputs matching the model you'll run in the simulator:

FlagDefaultMeaning
--use-vllmoffUse vLLM to free-generate outputs
--vllm-tp1TP degree for vLLM
--vllm-dtypebfloat16vLLM weight dtype

With --use-vllm:

  1. The prompt is taken from ShareGPT.
  2. vLLM generates a fresh response with that model.
  3. Both prompt and response are tokenized; input_tok_ids and output_tok_ids are populated.

Without --use-vllm:

  • Both prompt and response come from the ShareGPT entry as text.
  • Only the prompt is re-tokenized with --model's tokenizer for input_tok_ids.

Use --use-vllm when you specifically want output token IDs to match what the model would actually produce. For most simulator runs this isn't needed (the simulator doesn't generate text, it just counts tokens), but it's useful for downstream evaluation or if you want fully self-consistent traces.

Output format

The generator writes one JSONL line per request:

{"input_toks": 1472, "output_toks": 133, "arrival_time_ns": 4059740, "input_tok_ids": [...], "output_tok_ids": [...]}

Always flat format: ShareGPT entries don't have dependency chains. For agentic workloads see Agentic sessions.

The output filename convention is sharegpt-<model-short>-<n>-sps<rate>.jsonl (matches the bundled files).

Tips

  1. Tokenize with the same model the simulator will run. Otherwise the prefix-cache hit rate in the simulator won't match what production would see. The bundled JSONL files use this convention and are paired with their respective models.
  2. --max-sessions caps the source sample, not the output. Increase it if you're applying tight length filters and not getting enough surviving requests. Default 5000 is enough for most --num-reqs values.
  3. Pulse mode is great for sanity tests. A clean burst pattern exposes scheduler behavior that smooth Poisson arrivals can hide (queue buildup, fairness, head-of-line blocking).
  4. Generation speed. Without --use-vllm, generation is tokenization-bound and finishes in seconds. With --use-vllm, you pay real vLLM inference cost, minutes to hours depending on --num-reqs. Cache the output JSONL.
  5. Reuse JSONL across simulator runs. Generate once, simulate many times. The file is small (~MB) and self-contained.

What's next

  • JSONL format: schema reference for what the generator produces.
  • Agentic sessions: for closed-loop workloads. The ShareGPT generator only produces flat workloads.