Novaprospect

Most AI features ship with an evaluation suite. The suite was carefully built, often by the team that built the feature, and it does a reasonable job of measuring whether the feature behaves the way the team intended. The suite is then run in a notebook, by hand, before launches, and almost never afterward.

That is the wrong place for it. Evals belong on the same merge gate as the unit tests. The reasons are unglamorous and mostly about who notices regressions and when.

What evals actually measure

An AI feature's evaluation suite is a set of input/output pairs, or input/expected-property pairs, that capture the behaviors the team cares about. Does the summarizer preserve the named entities. Does the classifier route the obvious cases correctly. Does the agent stop and ask when the prompt is ambiguous. Does the retrieval step return the relevant document in the top three.

These are not unit tests in the traditional sense. They are not deterministic. A given input may produce slightly different outputs across runs, and the eval scores reflect that. But across a sufficient sample, the suite produces a number, and that number moves when the system underneath changes.

The number moves for boring reasons more often than for interesting ones. A model vendor updates the default version of an endpoint. A retrieval index gets reindexed. A prompt is edited by someone who did not realize the suite existed. A downstream library updates its tokenizer. Each of these can shift eval scores by a few percent. A few of these stacked together can shift them by ten.

If the suite is run by a data scientist in a notebook, on demand, before launches, none of these changes are caught at the moment they happen. They are caught at launch, weeks or months later, when the cumulative drift produces a result the team finds surprising. By then, the cause is mixed in with everything else that has changed, and the regression hunt is expensive.

Why CI is the right home

The eval suite belongs on the merge gate for the same reason the unit tests belong there: the moment to notice a regression is the commit that caused it. Catching a five-percent drop in retrieval recall at the PR that introduced it is a five-minute conversation. Catching the same drop after fifteen merges is an archaeology project.

This requires three things that most AI feature teams have not yet built.

First, the eval suite has to be runnable headlessly and reproducibly. No notebook cells. No "set this API key in the second cell." A command-line invocation that runs end to end and prints a structured result. This is mostly engineering work — the suite already exists, it just needs to be wrapped.

Second, the suite has to run in a reasonable amount of time. CI eval runs that take two hours are not going to land on the merge gate, no matter how much the team agrees they should. The realistic path is a fast tier that runs on every PR and a comprehensive tier that runs nightly or on a tag. Most teams underestimate how aggressively the fast tier needs to be pared down. Twenty representative examples per category, run with deterministic seeds where possible, is plenty.

Third, the thresholds have to be set. A CI gate that fires on a one-percent score drop will be ignored within a week, because the noise floor is higher than that. A gate that fires on a fifteen-percent drop catches the cases where the team would have caught the regression anyway. The useful threshold is the one calibrated against the suite's actual run-to-run variance, and it has to be revisited as the suite and the system evolve.

What this changes upstream

Putting evals on the merge gate has a second-order effect that is more valuable than the first. It changes who feels accountable for eval health.

When evals live in a notebook owned by data science, eval failures are a data science problem. When they live in CI alongside the unit tests, eval failures are a problem for whoever broke the build. The engineer who updated the prompt sees the failed eval before the PR merges. The platform engineer who upgraded the model SDK sees the failed eval before the rollout. The ownership distributes naturally to the places where regressions originate, which is where the cheapest fixes live.

This is the same dynamic that moved unit tests from "QA's problem" to "the author's problem" twenty years ago, and the productivity argument is identical. The longer the feedback loop, the more expensive the fix. Evals have been outside that loop for the entire history of AI features. They do not need to stay there.

The hard part

The hard part is honest. CI is a forcing function, and forcing functions surface uncomfortable truths. Many eval suites, on close inspection, do not actually measure what their teams claim they measure. Putting them on a merge gate makes that fact visible to everyone, not just to the data scientist who has been quietly working around it.

The right response is to fix the suite, not to abandon the gate. An eval that does not measure what it claims to measure is a problem regardless of where it runs. Surfacing it is a feature, not a bug. The teams that put their evals into CI tend to come out the other side with stronger evals, in part because the social cost of a weak eval suite is higher when everyone can see it.

Evals Belong in CI

What evals actually measure

Why CI is the right home

What this changes upstream

The hard part