We are closing the loop. Automatically.

Every team building an AI agent ends up running the same loop. You configure the agent. You run it. You inspect the output, decide whether it looks acceptable, and ship. Sometimes you formalize this with a handful of eval cases, get a score back, and try to make the number go up.

The score tells you something is off, but not how to fix it. It’s up to you to get from “73% pass” to a merged fix.

Autoloop closes that gap. Every simulation produces structured feedback, from the graders and from you, that turns into real code changes. Most of those changes land in your agent's source; a smaller set becomes edits to the graders themselves. Two things improve in parallel: the agent gets better at the cases you care about, and the graders get sharper at noticing what you would notice. After a few iterations, the manual effort required drops considerably.

How the loop works

  • Agentic graders. Each grader is an LLM agent with Read, Bash, and access to the forked database your simulation just ran against. It returns detailed observations against a grounded truth rather than a single number.
  • You annotate too. Confirm a grader's verdict, push back on it, add observations the grader missed, or comment on the grader itself. All notes are anchored to specific turns and spans.
  • Annotations become real diffs. Autoloop reads the full set of annotations and proposes code changes. Most edit your agent's source. The rest refine the grader's rubric.
  • The graders sharpen along the way. Over a handful of iterations, the grader internalizes what you care about. You write fewer annotations. The agent continues to improve.
  • Local-first. Each iteration runs against a local snapshot of your real environment. Nothing leaves your machine if you don't want to.


The rest of this post unpacks those five lines.

Graders that annotate, instead of just scoring

The default grader in most eval frameworks is a prompt: "rate this output from 1 to 5, here's the rubric." The grader returns a number, the dashboard updates, and the team moves on.

Autoloop graders can be LLM agents with tools. They can have Read, Bash, and access to the forked database your simulation used. When a grader reviews a run, it returns annotations alongside the verdict:

  • "Turn 3 failed: the agent confirmed the user's email in plaintext. Compliance rule 4."
  • "Turn 4 ran five database queries when one would have done. The first three were redundant lookups against `projects`."
  • "Turn 7 mentions 'Clippy' as if the user already knows what that is. Onboarding state indicates day 1."

Each annotation is anchored to a specific tool call or LLM call from the trace. Because graders are agentic, a “Skeptic” grader can run bash commands inside the fork to verify whether the row your agent claims to have inserted actually exists. The verdict comes from an actual ground truth, not from confidence in the agent’s transcript.

The score still appears on the dashboard, but the annotations are what you act on.

You annotate too

Your annotations follow the same format: anchored to specific turns and spans. Notes have several purposes. You can confirm a grader's finding ("yes, this is a real failure"). You might push back on a verdict ("this should have failed for X reason"). You might flag something the grader skipped, such as a tone shift on turn 6 that would annoy a real user. And, importantly, you can annotate the grader itself: "the rubric needs to catch implicit confirmations, not just literal ones."

By the end of a review session, the run has been reviewed, not just scored. By you and the grader, in the same document, with every note tied to a specific moment in the conversation.

From annotations to diffs

This is where the loop closes.

After a review session, Autoloop reads everything: the grader's annotations, your annotations, the trace, the rubric, transcripts, environment logs, metric scores and so on. It clusters the feedback by failure mode and proposes diffs. Most land in your agent's source as fixes for the issues the graders flagged. A smaller set becomes rubric edits, refining the grader itself when you have pushed back on its verdicts.

You review the diffs like any other code change. Apply the ones that look right, kick the next iteration against the same fork, and watch the dashboard. Did the email leak disappear? Did the five queries collapse into one? Did the new-user case get handled cleanly?

If yes, the loop closes on that issue. If not, you annotate the new run, propose new diffs, and run again. Every iteration shares a `loopId`, so the dashboard shows the full progression: v1, v2, v3, with what was fixed, what regressed, and what is new in each.

You can pass `--until-pass` if you want autoloop to retry until every grader returns green. Most of the time, you won't. You will review the annotations and apply diffs deliberately, the way you respond to a code review. The goal is not automation; it is that "73% pass" has been replaced by something concrete and actionable: a record of what was wrong, what you tried, and what worked.

That is the loop. It closes on real changes to your agent, not on numbers ticking upward.

The team of graders

The team of graders from the tagline doesn't appear out of nowhere. You build it in the same review sessions you're already running for the agent.

When you push back on a grader's verdict, or note that its rubric should have caught something, autoloop treats that as feedback about the grader rather than the agent, and proposes a rubric edit next to the usual fixes. You review both in the same pass. Apply the rubric edit, and the next run catches that case on its own. You're not writing that note again, because now the grader writes it for you.

This compounds in your favor. The first few runs against your domain need heavy annotation, because the grader doesn't know yet what you count as a failure. A dozen runs later you're barely annotating, because it does. Your review load keeps dropping while the agent keeps getting better.

Run autoloop for a few weeks and you end up with a small bench of graders, each reading runs through a different lens. One enforces your compliance rules. Another reads every reply the way a day-one user would. Add a senior engineer, a PM, whatever your domain needs. Each is a markdown file with a name and a rubric, committed to your repo where anyone on the team can run it. They catch what you'd catch, because you spent those sessions teaching them how.

What "closing the loop" means in practice

A loop closes when the distance between "something is broken" and "something is fixed" gets short enough that you keep doing it instead of putting it off. A bare "73%" never gets there, because it doesn't tell you what to change. An annotated run with a few diffs waiting for review does.

So try one. Run a simulation, point a grader at it, and read what it flags. Add what it missed and push back where it's wrong, then let autoloop turn the notes into diffs. Apply the ones that hold up, run it again, and watch the dashboard to see what actually moved. The first pass is the most work you'll do; after that the grader carries more of it each time, until you stop calling this evaluation and just call it the way you ship.

If you are interested in trying out our beta, reach out to us and we will make it happen.

Trending

Popular Articles This Month

Take your first step towards self-improving agents

Join forward-thinking teams using Scorecard to upgrade the way they build, test, and improve AI AGENTS.

Learn More