eval-first is not a development practice, it is a design decision
At eMerge Americas this past April, I sat in a panel hosted by Rohit Patel from Meta. The topic was what it takes to build a meaningful AI network. He framed it around three components. Interpretation. Structure. Parameters.
Then he talked about evals.
He defined an eval as a way to score a model for performance on a given capability. He said the order matters. Build the evals first. Baseline them. Then iterate on the agents. The eval is what tells you whether the agent did the thing you said it should do, and the eval has to exist before the agent does or the agent has nothing to iterate against.
After the panel I rebuilt how I design NORD around that ordering.
NORD v2 is a six-phase build. Phase 0 is the scaffold. Phase 1 is the foundation, broken into nine sub-phases. Phase 2 is enforcement. Phase 3 is memory. Phase 4 is multi-tenancy, telemetry, and onboarding. Phase 5 is the eval layer itself. Every one of those phases has a Definition of Done document. The DoD is written before any of the implementation work begins. It names what passing looks like, what failing looks like, and what the acceptance gate is going to check. Then the build starts.
The reason that ordering matters is not procedural. It is structural. An eval that defines the work cannot be skipped. An eval that is added after the work can be.
Phase 4 was the cleanest example I have run so far. Before any of the six deliverables for Phase 4 was written, NordSecurity ran an architecture review and produced 12 pre-build conditions. Specific things the build had to satisfy before the schema and security foundation that the rest of the phase sits on was allowed to start. Things like: every multi-tenant table has to have row-level security enabled at creation, not as a follow-up. The custom role claim has to be issued by the auth provider and verified inside the security policies, not assumed from the default. The fail-safe default has to reject unknown roles, not accept them.
All 12 conditions resolved before the build was unblocked. None of them was negotiated down. The conditions were the eval. They were the score the build had to clear to be considered done.
The build lifecycle that wraps every deliverable runs on the same principle. A builder agent produces the work. An independent review agent evaluates it against the DoD. If the review rejects, the builder revises and the cycle repeats until the reviewer accepts. Then the CEO reviews and the work is committed. The reviewer is not the builder. The criteria are not invented at review time. The same gate runs against every deliverable in the phase.
The pattern applies outside the agent system too. Earlier this month I ran a capability decomposition project on the catalog of skills NORD might adopt. The first filter pass produced 230 candidates. Before any candidate became a build, I wrote a decision-grade brief on it. What the capability is. What it does. The grounded use case for adopting it. What saying yes to it actually commits to building. The honest case against. A ruling. Then a wave assignment and an effort estimate. Forty-nine candidates passed the brief stage and are now in queue. Most of the rest were deferred, excluded, or rerouted to existing tooling. No skill is in the build pipeline because someone thought it was a good idea. Every one is in the pipeline because the brief held up.
There is a version of evaluation that is a discipline you apply to the work after it ships. Instrument the agent. Log the calls. Sample the outputs. Score them. Compare. That work is valuable. It is also not what I am describing.
What I am describing is the choice to let the evaluation define the work. The DoD exists before the deliverable. The pre-build conditions exist before the build. The brief exists before the skill. The acceptance gate is the thing the work has to clear, and the gate was set before anyone started writing the work that has to clear it.
Same words. Different structure. Different commitment.
The reason I walked out of that panel thinking about this differently is that Rohit Patel made the structural argument explicit. Build the evals first. Baseline them. Iterate. That ordering is not a process preference. It is what makes evaluation load-bearing. Add the eval after and the eval is a report card. Add it first and the eval is the design.
I rebuilt NORD's planning lifecycle around that distinction. Every phase, every deliverable, every capability candidate. The thing that has to be true at the end is defined before anyone starts the work. That is the only way I have found to keep an agent system honest at the scale I am running.
Get new posts by email
Friday digest, no filler. Drop your email below and I'll send what I publish.