I wrote my previous article about closing the loop for agentic development earlier this week, although the ideas themselves have been evolving for several days. This new piece is simply a progress report: how the approach is working in practice, what I’ve built so far, and what I’m learning as I push deeper into this workflow.

Short version: it’s working.
Long version: it’s working really well — but it’s also incredibly token-hungry.

Let’s talk about it.

A Familiar Benchmark: The Activity Stream Problem

Whenever I want to test a new development approach, I go back to a problem I know extremely well: building an activity stream.

An activity stream is basically the engine of a social network — posts, reactions, notifications, timelines, relationships. It touches everything:

  • Backend logic
  • UI behavior
  • Realtime updates
  • State management
  • Edge cases everywhere

I’ve implemented this many times before, so I know exactly how it should behave. That makes it the perfect benchmark for agentic development. If the AI handles this correctly, I know the workflow is solid.

This time, I used it to test the closing-the-loop concept.

The Current Setup

So far, I’ve built two main pieces:

  1. An MCP-based project
  2. A Blazor application implementing the activity stream

But the real experiment isn’t the app itself — it’s the workflow.

Instead of manually testing and debugging, I fully committed to this idea:

The AI writes, tests, observes, corrects, and repeats — without me acting as the middleman.

So I told Copilot very clearly:

  • Don’t ask me to test anything
  • You run the tests
  • You fix the issues
  • You verify the results

To make that possible, I wired everything together:

  • Playwright MCP for automated UI testing
  • Serilog logging to the file system
  • Screenshot capture of the UI during tests
  • Instructions to analyze logs and fix issues automatically

So the loop becomes:

write → test → observe → fix → retest

And honestly, I love it.

My Surface Is Working. I’m Not Touching It.

Here’s the funny part.

I’m writing this article on my MacBook Air.

Why?

Because my main development machine — a Microsoft Surface laptop — is currently busy running the entire loop by itself.

I told Copilot to open the browser and actually execute the tests visually. So it’s navigating the UI, filling forms, clicking buttons, taking screenshots… all by itself.

And I don’t want to touch that machine while it’s working.

It feels like watching a robot doing your job. You don’t interrupt it mid-task. You just observe.

So I switched computers and thought: “Okay, this is a perfect moment to write about what’s happening.”

That alone says a lot about where this workflow is heading.

Watching the Loop Close

Once everything was wired together, I let it run.

The agent:

  • Writes code
  • Runs Playwright tests
  • Reads logs
  • Reviews screenshots
  • Detects issues
  • Fixes them
  • Runs again

Seeing the system self-correct without constant intervention is incredibly satisfying.

In traditional AI-assisted development, you often end up exhausted:

  • The AI gets stuck
  • You explain the issue
  • It half-fixes it
  • You explain again
  • Something else breaks

You become the translator and debugger for the model.

With a self-correcting loop, that burden drops dramatically. The system can fail, observe, and recover on its own.

That changes everything.

The Token Problem (Yes, It’s Real)

There is one downside: this workflow is extremely token hungry.

Last month I used roughly 700% more tokens than usual. This month, and we’re only around February 8–9, I’ve already used about 200% of my normal limits.

Why so expensive?

Because the loop never sleeps:

  • Test execution
  • Log analysis
  • Screenshot interpretation
  • Code rewriting
  • Retesting
  • Iteration

Every cycle consumes tokens. And when the system is autonomous, those cycles happen constantly.

Model Choice Matters More Than You Think

Another important detail: not all models consume tokens equally inside Copilot.

Some models count as:

  • 3× usage
  • 1× usage
  • 0.33× usage
  • 0× usage

For example:

  • Some Anthropic models are extremely good for testing and reasoning
  • But they can count as 3× token usage
  • Others are cheaper but weaker
  • Some models (like GPT-4 Mini or GPT-4o in certain Copilot tiers) count as toward limits

At some point I actually hit my token limits and Copilot basically said: “Come back later.”

It should reset in about 24 hours, but in the meantime I switched to the 0× token models just to keep the loop running.

The difference in quality is noticeable.

The heavier models are much better at:

  • Debugging
  • Understanding logs
  • Self-correcting
  • Complex reasoning

The lighter or free models can still work, but they struggle more with autonomous correction.

So model selection isn’t just about intelligence — it’s about token economics.

Why It’s Still Worth It

Yes, this approach consumes more tokens.

But compare that to the alternative:

  • Sitting there manually testing
  • Explaining the same bug five times
  • Watching the AI fail repeatedly
  • Losing mental energy on trivial fixes

That’s expensive too — just not measured in tokens.

I would rather spend tokens than spend mental fatigue.

And realistically:

  • Models get cheaper every month
  • Tooling improves weekly
  • Context handling improves
  • Local and hybrid options are evolving

What feels expensive today might feel trivial very soon.

MCP + Blazor: A Perfect Testing Ground

So far, this workflow works especially well for:

  • MCP-based systems
  • Blazor applications
  • Known benchmark problems

Using a familiar problem like an activity stream lets me clearly measure progress. If the agent can build and maintain something complex that I already understand deeply, that’s a strong signal.

Right now, the signal is positive.

The loop is closing. The system is self-correcting. And it’s actually usable.

What Comes Next

This article is just a status update.

The next one will go deeper into something very important:

How to design self-correcting mechanisms for agentic development.

Because once you see an agent test, observe, and fix itself, you don’t want to go back to manual babysitting.

For now, though:

The idea is working. The workflow feels right. It’s token hungry. But absolutely worth it.

Closing the loop isn’t theory anymore — it’s becoming a real development style.