I wrote my previous article about closing the loop for agentic development earlier this week, although the ideas themselves have been evolving for several days. This new piece is simply a progress report: how the approach is working in practice, what I’ve built so far, and what I’m learning as I push deeper into this workflow.
Short version: it’s working.
Long version: it’s working really well — but it’s also incredibly token-hungry.
Let’s talk about it.
A Familiar Benchmark: The Activity Stream Problem
Whenever I want to test a new development approach, I go back to a problem I know extremely well: building an activity stream.
An activity stream is basically the engine of a social network — posts, reactions, notifications, timelines, relationships. It touches everything:
- Backend logic
- UI behavior
- Realtime updates
- State management
- Edge cases everywhere
I’ve implemented this many times before, so I know exactly how it should behave. That makes it the perfect benchmark for agentic development. If the AI handles this correctly, I know the workflow is solid.
This time, I used it to test the closing-the-loop concept.
The Current Setup
So far, I’ve built two main pieces:
- An MCP-based project
- A Blazor application implementing the activity stream
But the real experiment isn’t the app itself — it’s the workflow.
Instead of manually testing and debugging, I fully committed to this idea:
The AI writes, tests, observes, corrects, and repeats — without me acting as the middleman.
So I told Copilot very clearly:
- Don’t ask me to test anything
- You run the tests
- You fix the issues
- You verify the results
To make that possible, I wired everything together:
- Playwright MCP for automated UI testing
- Serilog logging to the file system
- Screenshot capture of the UI during tests
- Instructions to analyze logs and fix issues automatically
So the loop becomes:
write → test → observe → fix → retest
And honestly, I love it.
My Surface Is Working. I’m Not Touching It.
Here’s the funny part.
I’m writing this article on my MacBook Air.
Why?
Because my main development machine — a Microsoft Surface laptop — is currently busy running the entire loop by itself.
I told Copilot to open the browser and actually execute the tests visually. So it’s navigating the UI, filling forms, clicking buttons, taking screenshots… all by itself.
And I don’t want to touch that machine while it’s working.
It feels like watching a robot doing your job. You don’t interrupt it mid-task. You just observe.
So I switched computers and thought: “Okay, this is a perfect moment to write about what’s happening.”
That alone says a lot about where this workflow is heading.
Watching the Loop Close
Once everything was wired together, I let it run.
The agent:
- Writes code
- Runs Playwright tests
- Reads logs
- Reviews screenshots
- Detects issues
- Fixes them
- Runs again
Seeing the system self-correct without constant intervention is incredibly satisfying.
In traditional AI-assisted development, you often end up exhausted:
- The AI gets stuck
- You explain the issue
- It half-fixes it
- You explain again
- Something else breaks
You become the translator and debugger for the model.
With a self-correcting loop, that burden drops dramatically. The system can fail, observe, and recover on its own.
That changes everything.
The Token Problem (Yes, It’s Real)
There is one downside: this workflow is extremely token hungry.
Last month I used roughly 700% more tokens than usual. This month, and we’re only around February 8–9, I’ve already used about 200% of my normal limits.
Why so expensive?
Because the loop never sleeps:
- Test execution
- Log analysis
- Screenshot interpretation
- Code rewriting
- Retesting
- Iteration
Every cycle consumes tokens. And when the system is autonomous, those cycles happen constantly.
Model Choice Matters More Than You Think
Another important detail: not all models consume tokens equally inside Copilot.
Some models count as:
- 3× usage
- 1× usage
- 0.33× usage
- 0× usage
For example:
- Some Anthropic models are extremely good for testing and reasoning
- But they can count as 3× token usage
- Others are cheaper but weaker
- Some models (like GPT-4 Mini or GPT-4o in certain Copilot tiers) count as 0× toward limits
At some point I actually hit my token limits and Copilot basically said: “Come back later.”
It should reset in about 24 hours, but in the meantime I switched to the 0× token models just to keep the loop running.
The difference in quality is noticeable.
The heavier models are much better at:
- Debugging
- Understanding logs
- Self-correcting
- Complex reasoning
The lighter or free models can still work, but they struggle more with autonomous correction.
So model selection isn’t just about intelligence — it’s about token economics.
Why It’s Still Worth It
Yes, this approach consumes more tokens.
But compare that to the alternative:
- Sitting there manually testing
- Explaining the same bug five times
- Watching the AI fail repeatedly
- Losing mental energy on trivial fixes
That’s expensive too — just not measured in tokens.
I would rather spend tokens than spend mental fatigue.
And realistically:
- Models get cheaper every month
- Tooling improves weekly
- Context handling improves
- Local and hybrid options are evolving
What feels expensive today might feel trivial very soon.
MCP + Blazor: A Perfect Testing Ground
So far, this workflow works especially well for:
- MCP-based systems
- Blazor applications
- Known benchmark problems
Using a familiar problem like an activity stream lets me clearly measure progress. If the agent can build and maintain something complex that I already understand deeply, that’s a strong signal.
Right now, the signal is positive.
The loop is closing. The system is self-correcting. And it’s actually usable.
What Comes Next
This article is just a status update.
The next one will go deeper into something very important:
How to design self-correcting mechanisms for agentic development.
Because once you see an agent test, observe, and fix itself, you don’t want to go back to manual babysitting.
For now, though:
The idea is working. The workflow feels right. It’s token hungry. But absolutely worth it.
Closing the loop isn’t theory anymore — it’s becoming a real development style.