# CI Performance Policy

This repo tracks GitHub Actions wall time so CI optimization work is driven by
trend data, not one-off slow runs. Use this policy when deciding whether to
split, rebalance, or otherwise optimize CI jobs.

## Current Posture

Stop active CI-splitting work when the required test jobs are already in the
same rough band. As a default, stop when:

- The top required test jobs are within about 20-30% of each other.
- The slowest required test job is around 2 minutes.
- The expected critical-path win is under about 30 seconds.
- The proposed split adds comparable maintenance cost: more matrix entries,
  ports, artifacts, sharding rules, or performance baselines.

At that point, keep the timing instrumentation and wait for a concrete trigger
instead of continuing to split jobs proactively.

## Revisit Triggers

Revisit CI wall-time optimization when at least one of these holds across
normal runs:

- A required non-deploy job is over 3 minutes.
- One required non-deploy job is more than 50% slower than comparable jobs and
  at least 30 seconds slower in absolute terms.
- Required non-deploy checks take more than 8 minutes from first start to last
  completion.
- The same job repeatedly appears as `OVER` or `CLOSE` in Performance Check,
  rather than as a one-run fluctuation.
- New tests clearly cluster in one shard or suite and make it consistently
  heavier.

Performance Check prints a non-blocking `CI Wall-Time Revisit Signals` section
for the first three triggers. Treat it as a prompt to inspect the data, not as a
failure by itself.

## How To Respond

1. Start from the latest completed `main` run and its Performance Check log.
2. Prefer timing artifacts and repeated runs over a single outlier.
3. First look for a low-maintenance rebalance, such as moving a heavy test file
   between existing shards.
4. Split a job only when the boundary is already clear and the split preserves
   local developer workflows.
5. If Performance Check asks for a `NEW_PERF_BASELINE`, make sure the metric is
   understood and note whether it is related to the CI change.

Good CI optimization PRs should reduce critical-path wall time without making
the workflow harder to reason about.

## Pulling Timing Data

The labs repository is public, so the GitHub Actions REST API returns run, job,
and per-step timings unauthenticated — no `gh` or token needed. Logs and
artifacts do need an admin token, so the per-test timings in the `test-timing-*`
artifacts are not reachable this way; measure those locally.

Jobs and steps for a run:
`GET /repos/commontoolsinc/labs/actions/runs/<run-id>/jobs?per_page=100` — each
job and step carries `started_at` and `completed_at`.

## Coverage Debt Baselines

Performance Check also tracks coverage debt as uncovered source lines. See
[COVERAGE.md](COVERAGE.md) for how that coverage is collected and which CI job
measures which code. Coverage debt uses a latest-main ratchet for source groups
changed by the PR: any increase in a changed group fails unless the PR
explicitly accepts it. Debt metrics for unchanged groups are still reported, but
they do not block the PR.

Use the narrow per-metric form when a PR intentionally increases one coverage
debt metric:

```text
NEW_PERF_BASELINE: coverage-debt: packages/runner uncovered lines = 123 lines
```

Use the broad reset marker only to bootstrap coverage data for the first time,
or when the upstream coverage baseline is known to be bogus and should be
re-seeded for one cycle:

```text
NEW_COVERAGE_BASELINE
```

When that PR merges, the main run's coverage metrics become the new ratchet
baseline for later PRs. Performance Check still requires the full expected
coverage artifact set during that reset cycle. Jobs with no reportable covered
files upload an empty LCOV report so missing artifacts mean the report upload
itself failed.