Closing the Loop with AI (part 3): Moving the Human to the End of the Pipeline

Closing the Loop with AI (part 3): Moving the Human to the End of the Pipeline

My last two articles have been about one idea: closing the loop with AI.

Not “AI-assisted coding.” Not “AI that helps you write functions.”
I’m talking about something else entirely.

I’m talking about building systems where the agent writes the code, tests the code, evaluates the result,
fixes the code, and repeats — without me sitting in the middle acting like a tired QA engineer.

Because honestly, that middle position is the worst place to be.

You get exhausted. You lose objectivity. And eventually you look at the project and think:
everything here is garbage.

So the goal is simple:

Remove the human from the middle of the loop.

Place the human at the end of the loop.

The human should only confirm: “Is this what I asked for?”
Not manually test every button.

The Real Question: How Do You Close the Loop?

There isn’t a single answer. It depends on the technology stack and the type of application you’re building.
So far, I’ve been experimenting with three environments:

  • Console applications
  • Web applications
  • Windows Forms applications (still a work in progress)

Each one requires a slightly different strategy.

But the core principle is always the same:

The agent must be able to observe what it did.

If the agent cannot see logs, outputs, state, or results — the loop stays open.

Console Applications: The Easiest Loop to Close

Console apps are the simplest place to start.

My setup is minimal and extremely effective:

  • Serilog writing structured logs
  • Logs written to the file system
  • Output written to the console

Why both?

Because the agent (GitHub Copilot in VS Code) can run the app, read console output, inspect log files,
decide what to fix, and repeat.

No UI. No browser. No complex state.
Just input → execution → output → evaluation.

If you want to experiment with autonomous loops, start here. Console apps are the cleanest lab environment you’ll ever get.

Web Applications: Where Things Get Interesting

Web apps are more complex, but also more powerful.

My current toolset:

  • Serilog for structured logging
  • Logs written to filesystem
  • SQLite for loop-friendly database inspection
  • Playwright for automated UI testing

Even if production uses PostgreSQL or SQL Server, I use SQLite during loop testing.
Not for production. For iteration.

The SQLite CLI makes inspection trivial.
The agent can call the API, trigger workflows, query SQLite directly, verify results, and continue fixing.

That’s a full feedback loop. No human required.

Playwright: Giving the Agent Eyes

For UI testing, Playwright is the key.

You can run it headless (fully autonomous) or with UI visible (my preferred mode).

Yes, I could remove myself completely. But I don’t.
Right now I sit outside the loop as an observer.
Not a tester. Not a debugger. Just watching.

If something goes completely off the rails, I interrupt.
Otherwise, I let the loop run.

This is an important transition:

From participant → to observer.

The Windows Forms Problem

Now comes the tricky part: Windows Forms.

Console apps are easy. Web apps have Playwright.
But desktop UI automation is messy.

Possible directions I’m exploring:

  • UI Automation APIs
  • WinAppDriver
  • Logging + state inspection hybrid approach
  • Screenshot-based verification
  • Accessibility tree inspection

The goal remains the same: the agent must be able to verify what happened without me.

Once that happens, the loop closes.

What I’ve Learned So Far

1) Logs Are Everything

If the agent cannot read what happened, it cannot improve. Structured logs > pretty logs. Always.

2) SQLite Is the Perfect Loop Database

Not for production. For iteration. The ability to query state instantly from CLI makes autonomous debugging possible.

3) Agents Need Observability, Not Prompts

Most people focus on prompt engineering. I focus on observability engineering.
Give the agent visibility into logs, state, outputs, errors, and the database. Then iteration becomes natural.

4) Humans Should Validate Outcomes — Not Steps

The human should only answer: “Is this what I asked for?” That’s what the agent is for.

My Current Loop Architecture (Simplified)

Specification → Agent writes code → Agent runs app → Agent tests → Agent reads logs/db →
Agent fixes → Repeat → Human validates outcome

If the loop works, progress becomes exponential.
If the loop is broken, everything slows down.

My Question to You

This is still evolving. I’m refining the process daily, and I’m convinced this is how development will work from now on:
agents running closed feedback loops with humans validating outcomes at the end.

So I’m curious:

  • What tooling are you using?
  • How are you creating feedback loops?
  • Are you still inside the loop — or already outside watching it run?

Because once you close the loop…
you don’t want to go back.

 

Closing the Loop (Part 2): So Far, So Good — and Yes, It’s Token Hungry

Closing the Loop (Part 2): So Far, So Good — and Yes, It’s Token Hungry

I wrote my previous article about closing the loop for agentic development earlier this week, although the ideas themselves have been evolving for several days. This new piece is simply a progress report: how the approach is working in practice, what I’ve built so far, and what I’m learning as I push deeper into this workflow.

Short version: it’s working.
Long version: it’s working really well — but it’s also incredibly token-hungry.

Let’s talk about it.

A Familiar Benchmark: The Activity Stream Problem

Whenever I want to test a new development approach, I go back to a problem I know extremely well: building an activity stream.

An activity stream is basically the engine of a social network — posts, reactions, notifications, timelines, relationships. It touches everything:

  • Backend logic
  • UI behavior
  • Realtime updates
  • State management
  • Edge cases everywhere

I’ve implemented this many times before, so I know exactly how it should behave. That makes it the perfect benchmark for agentic development. If the AI handles this correctly, I know the workflow is solid.

This time, I used it to test the closing-the-loop concept.

The Current Setup

So far, I’ve built two main pieces:

  1. An MCP-based project
  2. A Blazor application implementing the activity stream

But the real experiment isn’t the app itself — it’s the workflow.

Instead of manually testing and debugging, I fully committed to this idea:

The AI writes, tests, observes, corrects, and repeats — without me acting as the middleman.

So I told Copilot very clearly:

  • Don’t ask me to test anything
  • You run the tests
  • You fix the issues
  • You verify the results

To make that possible, I wired everything together:

  • Playwright MCP for automated UI testing
  • Serilog logging to the file system
  • Screenshot capture of the UI during tests
  • Instructions to analyze logs and fix issues automatically

So the loop becomes:

write → test → observe → fix → retest

And honestly, I love it.

My Surface Is Working. I’m Not Touching It.

Here’s the funny part.

I’m writing this article on my MacBook Air.

Why?

Because my main development machine — a Microsoft Surface laptop — is currently busy running the entire loop by itself.

I told Copilot to open the browser and actually execute the tests visually. So it’s navigating the UI, filling forms, clicking buttons, taking screenshots… all by itself.

And I don’t want to touch that machine while it’s working.

It feels like watching a robot doing your job. You don’t interrupt it mid-task. You just observe.

So I switched computers and thought: “Okay, this is a perfect moment to write about what’s happening.”

That alone says a lot about where this workflow is heading.

Watching the Loop Close

Once everything was wired together, I let it run.

The agent:

  • Writes code
  • Runs Playwright tests
  • Reads logs
  • Reviews screenshots
  • Detects issues
  • Fixes them
  • Runs again

Seeing the system self-correct without constant intervention is incredibly satisfying.

In traditional AI-assisted development, you often end up exhausted:

  • The AI gets stuck
  • You explain the issue
  • It half-fixes it
  • You explain again
  • Something else breaks

You become the translator and debugger for the model.

With a self-correcting loop, that burden drops dramatically. The system can fail, observe, and recover on its own.

That changes everything.

The Token Problem (Yes, It’s Real)

There is one downside: this workflow is extremely token hungry.

Last month I used roughly 700% more tokens than usual. This month, and we’re only around February 8–9, I’ve already used about 200% of my normal limits.

Why so expensive?

Because the loop never sleeps:

  • Test execution
  • Log analysis
  • Screenshot interpretation
  • Code rewriting
  • Retesting
  • Iteration

Every cycle consumes tokens. And when the system is autonomous, those cycles happen constantly.

Model Choice Matters More Than You Think

Another important detail: not all models consume tokens equally inside Copilot.

Some models count as:

  • 3× usage
  • 1× usage
  • 0.33× usage
  • 0× usage

For example:

  • Some Anthropic models are extremely good for testing and reasoning
  • But they can count as 3× token usage
  • Others are cheaper but weaker
  • Some models (like GPT-4 Mini or GPT-4o in certain Copilot tiers) count as toward limits

At some point I actually hit my token limits and Copilot basically said: “Come back later.”

It should reset in about 24 hours, but in the meantime I switched to the 0× token models just to keep the loop running.

The difference in quality is noticeable.

The heavier models are much better at:

  • Debugging
  • Understanding logs
  • Self-correcting
  • Complex reasoning

The lighter or free models can still work, but they struggle more with autonomous correction.

So model selection isn’t just about intelligence — it’s about token economics.

Why It’s Still Worth It

Yes, this approach consumes more tokens.

But compare that to the alternative:

  • Sitting there manually testing
  • Explaining the same bug five times
  • Watching the AI fail repeatedly
  • Losing mental energy on trivial fixes

That’s expensive too — just not measured in tokens.

I would rather spend tokens than spend mental fatigue.

And realistically:

  • Models get cheaper every month
  • Tooling improves weekly
  • Context handling improves
  • Local and hybrid options are evolving

What feels expensive today might feel trivial very soon.

MCP + Blazor: A Perfect Testing Ground

So far, this workflow works especially well for:

  • MCP-based systems
  • Blazor applications
  • Known benchmark problems

Using a familiar problem like an activity stream lets me clearly measure progress. If the agent can build and maintain something complex that I already understand deeply, that’s a strong signal.

Right now, the signal is positive.

The loop is closing. The system is self-correcting. And it’s actually usable.

What Comes Next

This article is just a status update.

The next one will go deeper into something very important:

How to design self-correcting mechanisms for agentic development.

Because once you see an agent test, observe, and fix itself, you don’t want to go back to manual babysitting.

For now, though:

The idea is working. The workflow feels right. It’s token hungry. But absolutely worth it.

Closing the loop isn’t theory anymore — it’s becoming a real development style.