Coding conventions
A short checklist. Skim before opening a PR. None of these are arbitrary; each has bitten the project at least once.
Python style
- 4-space indentation, snake_case for functions/variables, PascalCase for classes. Match surrounding code in the file you're editing.
- No enforced formatter. Don't run black / ruff format on a whole file unless you're rewriting it. Style noise hides real diffs.
- Imports: keep minimal and consistent.
serving/modules use relative imports (from .scheduler import …). - English only in code, comments, log messages, and docstrings. Korean / other-language identifiers and comments will be flagged in review.
- Docstrings: optional. If you write one, make it a single line that explains why the function exists, not what it does. The signature already says what.
- No top-level prints. Use
serving/core/logger.py(already imported asloggerin most files):logger.info(...)logger.warning(...)logger.success(...) # Rich-styled green check
CLI flag conventions
- CLI flags use hyphens:
--cluster-config,--max-num-seqs,--enable-prefix-caching. - Internal Python uses underscores:
cluster_config,max_num_seqs,enable_prefix_caching. - Boolean flags use
BooleanOptionalActionso both--enable-Xand--no-enable-Xwork:parser.add_argument('--enable-prefix-caching',action=argparse.BooleanOptionalAction,default=True) - Match vLLM naming where applicable
(
--max-num-batched-tokens,--block-size,--kv-cache-dtype). Users coming from vLLM should not have to relearn.
File and config naming
- JSON config filenames: descriptive snake_case
(
single_node_pim_instance.json, notsingleNodePimInstance.json). - One config = one scenario. Don't reuse the same cluster JSON across unrelated examples; copy it.
- Don't commit machine-specific paths. All paths in code and configs must be relative to the repo root.
Things to never do
These each correspond to a real incident or strong project preference:
-
Don't add
getattr(request, 'attr', default)fallbacks forRequestattributes. Initialize all attributes inRequest.__init__and access directly. Fallbacks hide initialization bugs. -
Don't assume
hidden_size == num_heads * head_dim. Some models (Qwen3) violate this. Always:head_dim = config.get('head_dim', n_embd // n_head)q_dim = n_head * head_dim # NOT n_embdkv_dim = kv_head * head_dim # NOT n_embd // group -
Don't invent layer names. Every name the simulator emits must also appear in the architecture YAML's catalog. Canonical set:
qkv_proj,o_proj,gate_up_proj,act_fn,down_proj,rotary_emb,qk_norm,attention,layernorm,final_layernorm,embedding,lm_head,sampler,moe. -
Don't edit
astra-sim/unless the change targets simulator integration (Chakra converter,Workload.cc, input configs). Most contributions never touch this directory. -
Don't manually edit
astra-sim/inputs/*.json. Those files are regenerated byconfig_builder.pyon every run; your edits will be silently overwritten. -
Don't commit large generated files. Trace files,
outputs/*.csvfrom your local runs,.etprotobufs, and profiler bundle CSVs that exceed the gitignore patterns should stay local. The gitignore is set up; just don'tgit add -A. -
Don't use
--no-verifyto bypass pre-commit hooks. If a hook fails, fix the underlying issue. -
Don't add error handling for cases that can't happen. Trust internal invariants; only validate at the boundaries (CLI args, JSON config load, dataset parsing). Defensive programming inside
scheduler.pymakes the file unreadable. -
Don't add features beyond the task at hand. A bug fix doesn't need surrounding cleanup. Three similar lines is better than a premature abstraction.
-
Don't add comments explaining what the code does. The identifier names already do that. Comments are reserved for why something non-obvious is the way it is (a hidden invariant, a bug workaround, a citation to a paper).
Layer-name and unit reminders
These two trip up new contributors most often:
- Profiler CSVs store microseconds (
time_uscolumn). The simulator multiplies by 1000 and rounds to nanoseconds at load time. Don't divide twice. - Communication sizes for ASTRA-Sim are total (not per-NPU) bytes. ASTRA-Sim divides by ring size internally. If you pass per-NPU sizes, every collective will be N times too small.
Trace-format invariants
If you touch trace_generator.py or graph_generator.py:
- The first layer's
input_locand the last layer'soutput_locmust beREMOTE:{node_id}. The Chakra converter emits aMEM_LOADfrom the first and aMEM_STOREfrom the last; if either isLOCALwithout local memory configured, ASTRA-Sim crashes. - The sampler's
output_locis what feeds theMEM_STORE. Don't put it onlm_head.
Commit and PR style
The short version (full process is on PR workflow):
- Commit messages: short imperative one-liner.
- Good:
Fix incorrect evict_size accumulation,Add Qwen3 model support. - Bad:
fixes,update scheduler.py,WIP.
- Good:
- One logical change per commit. Don't bundle a refactor with a feature.
- PR description includes the validation command you ran, so the reviewer can rerun it.
What's next
- Validating your changes: how to prove the change actually works.
- PR workflow: branch model, attribution, review expectations.