Closing the Loop (Part 2): So Far, So Good — and Yes, It’s Token Hungry

Closing the Loop (Part 2): So Far, So Good — and Yes, It’s Token Hungry

I wrote my previous article about closing the loop for agentic development earlier this week, although the ideas themselves have been evolving for several days. This new piece is simply a progress report: how the approach is working in practice, what I’ve built so far, and what I’m learning as I push deeper into this workflow.

Short version: it’s working.
Long version: it’s working really well — but it’s also incredibly token-hungry.

Let’s talk about it.

A Familiar Benchmark: The Activity Stream Problem

Whenever I want to test a new development approach, I go back to a problem I know extremely well: building an activity stream.

An activity stream is basically the engine of a social network — posts, reactions, notifications, timelines, relationships. It touches everything:

  • Backend logic
  • UI behavior
  • Realtime updates
  • State management
  • Edge cases everywhere

I’ve implemented this many times before, so I know exactly how it should behave. That makes it the perfect benchmark for agentic development. If the AI handles this correctly, I know the workflow is solid.

This time, I used it to test the closing-the-loop concept.

The Current Setup

So far, I’ve built two main pieces:

  1. An MCP-based project
  2. A Blazor application implementing the activity stream

But the real experiment isn’t the app itself — it’s the workflow.

Instead of manually testing and debugging, I fully committed to this idea:

The AI writes, tests, observes, corrects, and repeats — without me acting as the middleman.

So I told Copilot very clearly:

  • Don’t ask me to test anything
  • You run the tests
  • You fix the issues
  • You verify the results

To make that possible, I wired everything together:

  • Playwright MCP for automated UI testing
  • Serilog logging to the file system
  • Screenshot capture of the UI during tests
  • Instructions to analyze logs and fix issues automatically

So the loop becomes:

write → test → observe → fix → retest

And honestly, I love it.

My Surface Is Working. I’m Not Touching It.

Here’s the funny part.

I’m writing this article on my MacBook Air.

Why?

Because my main development machine — a Microsoft Surface laptop — is currently busy running the entire loop by itself.

I told Copilot to open the browser and actually execute the tests visually. So it’s navigating the UI, filling forms, clicking buttons, taking screenshots… all by itself.

And I don’t want to touch that machine while it’s working.

It feels like watching a robot doing your job. You don’t interrupt it mid-task. You just observe.

So I switched computers and thought: “Okay, this is a perfect moment to write about what’s happening.”

That alone says a lot about where this workflow is heading.

Watching the Loop Close

Once everything was wired together, I let it run.

The agent:

  • Writes code
  • Runs Playwright tests
  • Reads logs
  • Reviews screenshots
  • Detects issues
  • Fixes them
  • Runs again

Seeing the system self-correct without constant intervention is incredibly satisfying.

In traditional AI-assisted development, you often end up exhausted:

  • The AI gets stuck
  • You explain the issue
  • It half-fixes it
  • You explain again
  • Something else breaks

You become the translator and debugger for the model.

With a self-correcting loop, that burden drops dramatically. The system can fail, observe, and recover on its own.

That changes everything.

The Token Problem (Yes, It’s Real)

There is one downside: this workflow is extremely token hungry.

Last month I used roughly 700% more tokens than usual. This month, and we’re only around February 8–9, I’ve already used about 200% of my normal limits.

Why so expensive?

Because the loop never sleeps:

  • Test execution
  • Log analysis
  • Screenshot interpretation
  • Code rewriting
  • Retesting
  • Iteration

Every cycle consumes tokens. And when the system is autonomous, those cycles happen constantly.

Model Choice Matters More Than You Think

Another important detail: not all models consume tokens equally inside Copilot.

Some models count as:

  • 3× usage
  • 1× usage
  • 0.33× usage
  • 0× usage

For example:

  • Some Anthropic models are extremely good for testing and reasoning
  • But they can count as 3× token usage
  • Others are cheaper but weaker
  • Some models (like GPT-4 Mini or GPT-4o in certain Copilot tiers) count as toward limits

At some point I actually hit my token limits and Copilot basically said: “Come back later.”

It should reset in about 24 hours, but in the meantime I switched to the 0× token models just to keep the loop running.

The difference in quality is noticeable.

The heavier models are much better at:

  • Debugging
  • Understanding logs
  • Self-correcting
  • Complex reasoning

The lighter or free models can still work, but they struggle more with autonomous correction.

So model selection isn’t just about intelligence — it’s about token economics.

Why It’s Still Worth It

Yes, this approach consumes more tokens.

But compare that to the alternative:

  • Sitting there manually testing
  • Explaining the same bug five times
  • Watching the AI fail repeatedly
  • Losing mental energy on trivial fixes

That’s expensive too — just not measured in tokens.

I would rather spend tokens than spend mental fatigue.

And realistically:

  • Models get cheaper every month
  • Tooling improves weekly
  • Context handling improves
  • Local and hybrid options are evolving

What feels expensive today might feel trivial very soon.

MCP + Blazor: A Perfect Testing Ground

So far, this workflow works especially well for:

  • MCP-based systems
  • Blazor applications
  • Known benchmark problems

Using a familiar problem like an activity stream lets me clearly measure progress. If the agent can build and maintain something complex that I already understand deeply, that’s a strong signal.

Right now, the signal is positive.

The loop is closing. The system is self-correcting. And it’s actually usable.

What Comes Next

This article is just a status update.

The next one will go deeper into something very important:

How to design self-correcting mechanisms for agentic development.

Because once you see an agent test, observe, and fix itself, you don’t want to go back to manual babysitting.

For now, though:

The idea is working. The workflow feels right. It’s token hungry. But absolutely worth it.

Closing the loop isn’t theory anymore — it’s becoming a real development style.

 

Closing the Loop: Letting AI Finish the Work

Closing the Loop: Letting AI Finish the Work

Last week I was in Sochi on a ski trip. Instead of skiing, I got sick.

So I spent a few days locked in a hotel room, doing what I always do when I can’t move much: working. Or at least what looks like work. In reality, it’s my hobby.

YouTube wasn’t working well there, so I downloaded a few episodes in advance. Most of them were about OpenClaw and its creator, Peter Steinberger — also known for building PSPDFKit.

What started as passive watching turned into one of those rare moments of clarity you only get when you’re forced to slow down.

Shipping Code You Don’t Read (In the Right Context)

In one of the interviews, Peter said something that immediately caught my attention: he ships code he doesn’t review.

At first that sounds reckless. But then I realized… I sometimes do the same.

However, context matters.

Most of my daily work is research and development. I build experimental systems, prototypes, and proofs of concept — either for our internal office or for exploring ideas with clients. A lot of what I write is not production software yet. It’s exploratory. It’s about testing possibilities.

In that environment, I don’t always need to read every line of generated code.

If the use case works and the tests pass, that’s often enough.

I work mainly with C#, ASP.NET, Entity Framework, and XAF from DevExpress. I know these ecosystems extremely well. So if something breaks later, I can go in and fix it myself. But most of the time, the goal isn’t to perfect the implementation — it’s to validate the idea.

That’s a crucial distinction.

When writing production code for a customer, quality and review absolutely matter. You must inspect, verify, and ensure maintainability. But when working on experimental R&D, the priority is different: speed of validation and clarity of results.

In research mode, not every line needs to be perfect. It just needs to prove whether the idea works.

Working “Without Hands”

My real goal is to operate as much as possible without hands.

By that I mean minimizing direct human interaction with implementation. I want to express intent clearly enough so agents can execute it.

If I can describe a system precisely — especially in domains I know deeply — then the agent should be able to build, test, and refine it. My role becomes guiding and validating rather than manually constructing everything.

This is where modern development is heading.

The Problem With Vibe Coding

Peter talked about something that resonated deeply: when you’re vibe coding, you produce a lot of AI slop.

You prompt. The AI generates. You run it. It fails. You tweak. You run again. Still wrong. You tweak again.

Eventually, the human gets tired.

Even when you feel close to a solution, it’s not done until it’s actually done. And manually pushing that process forward becomes exhausting.

This is where many AI workflows break down. Not because the AI can’t generate solutions — but because the loop still depends too heavily on human intervention.

Closing the Loop

The key idea is simple and powerful: agentic development works when the agent can test and correct itself.

You must close the loop.

Instead of: human → prompt → AI → human checks → repeat

You want: AI → builds → tests → detects errors → fixes → tests again → repeat

The agent needs tools to evaluate its own output.

When AI can run tests, detect failures, and iterate automatically, something shifts. The process stops being experimental prompting and starts becoming real engineering.

Spec-Driven vs Self-Correcting Systems

Spec-driven development still matters. Some people dismiss it as too close to waterfall, but every methodology has flaws.

The real evolution is combining clear specifications with self-correcting loops.

The human defines:

  • The specification
  • The expected behavior
  • The acceptance criteria

Then the AI executes, tests, and refines until those criteria are satisfied.

The human doesn’t need to babysit every iteration. The human validates the result once the loop is closed.

Engineering vs Parasitic Ideas

There’s a concept from a book about parasitic ideas.

In social sciences, parasitic ideas can spread because they’re hard to disprove. In engineering, bad ideas fail quickly.

If you design a bridge incorrectly, it collapses. Reality provides immediate feedback.

Software — especially AI-generated software — needs the same grounding in reality. Without continuous testing and validation, generated code can drift into something that looks plausible but doesn’t work.

Closing the loop forces ideas to confront reality.

Tests are that reality.

Taking the Human Out of the Repetitive Loop

The goal isn’t removing humans entirely. It’s removing humans from repetitive validation.

The human should:

  • Define the specification
  • Define what “done” means
  • Approve the final result

The AI should:

  • Implement
  • Test
  • Detect issues
  • Fix itself
  • Repeat until success

When that happens, development becomes scalable in a new way. Not because AI writes code faster — but because AI can finish what it starts.

What I Realized in That Hotel Room

Getting sick in Sochi wasn’t part of the plan. But it forced me to slow down long enough to notice something important.

Most friction in modern development isn’t writing code. It’s closing loops.

We generate faster than we validate. We start more than we finish. We rely on humans to constantly re-check work that machines could verify themselves.

In research and experimental work, it’s fine not to inspect every line — as long as the system proves its behavior. In production work, deeper review is essential. Knowing when each approach applies is part of modern engineering maturity.

The future of agentic development isn’t just better models. It’s better loops.

Because in the end, nothing is finished until the loop is closed.

 

Github Copilot for the Rest of Us

Github Copilot for the Rest of Us

How GitHub Copilot Became My Sysadmin, Writer, and Creative Partner

When people talk about GitHub Copilot, they almost always describe it the same way: an AI that writes code.
That’s true—Copilot can write code—but treating it as “just a coding tool” is like calling a smartphone
“a device for making phone calls.”

The moment you start using Copilot inside Visual Studio Code, something important changes:
it stops being a code generator and starts behaving more like a context-aware work partner.
Not because it magically knows everything—but because VS Code gives it access to the things that matter:
your files, your folders, your terminals, your scripts, your logs, and even your remote machines.

That’s why this article isn’t about code autocomplete. It’s about the other side of Copilot:
the part that’s useful for people who are building, maintaining, writing, organizing, diagnosing, or shipping
real work—especially the messy kind.

Copilot as a Linux Server Sidekick

One of my most common uses for Copilot has nothing to do with application logic.
I use it for Linux server setup and diagnostics.

If you run Copilot in VS Code and you also use Remote development (SSH), you essentially get a workspace that can:

  • Connect to Linux servers over SSH
  • Edit remote configuration files safely
  • Run commands and scripts in an integrated terminal
  • Search through logs and system files quickly
  • Manage folders like they’re local projects

That means Copilot isn’t “helping me code.” It’s helping me operate.

I often set up hosting and administration tools like Virtualmin or Webmin, or configure other infrastructure:
load balancers, web servers, SSL, firewall rules, backups—whatever the server needs to become stable and usable.
In those situations Copilot becomes the assistant that speeds up the most annoying parts:
the remembering, the searching, the cross-checking, and the “what does this error actually mean?”

What this looks like in practice

Instead of bouncing between browser tabs and old notes, I’ll use Copilot directly in the workspace:

  • “Explain what this service error means and suggest the next checks.”
  • “Read this log snippet and list the most likely causes.”
  • “Generate a safe Nginx config for this domain layout.”
  • “Create a hardening checklist for a fresh VPS.”
  • “What would you verify before assuming this is a network issue?”

The benefit isn’t that Copilot is always right. The benefit is that it helps you move faster with less friction—
and it keeps your work inside the same place where the files and commands actually live.

Copilot as an Operations Brain (Not Just a Code Brain)

Here’s the real mental shift:

Copilot doesn’t need to write code to be useful. It needs context.

In VS Code, that context includes the entire workspace: configuration files, scripts, documentation, logs,
command history, and whatever you’re currently working on. Once you realize that, Copilot becomes useful for:

  • Debugging infrastructure problems
  • Translating “error messages” into “actionable steps”
  • Drafting repeatable setup scripts
  • Creating operational runbooks and checklists
  • Turning tribal knowledge into documentation

It’s especially valuable when the work is messy and practical—when you’re not trying to invent something new,
you’re trying to make something work.

Copilot as a Writing Workspace

Now switch gears. One of the best non-coding Copilot stories I’ve seen is my cousin Alexandra.
She’s writing a small storybook.

She started the way a lot of people do: writing by hand, collecting pages, keeping ideas in scattered places.
At one point she was using Copilot through Microsoft Office, but I suggested a different approach:

Use VS Code as the creative workspace.

Not because VS Code is “a writing tool,” but because it gives you structure for free:

  • A folder becomes the book
  • Each chapter becomes a file
  • Markdown becomes a simple, readable format
  • Git (optionally) becomes version history
  • Copilot becomes the editor, brainstormer, and consistency checker

In that setup, Copilot isn’t writing the story for you. It’s helping you shape it:
rewrite a paragraph, suggest alternatives, tighten dialogue, keep a consistent voice,
summarize a scene, or generate a few options when you’re stuck.

Yes, Even Illustrations (Within Reason)

This surprises people: you can also support simple illustrations inside a VS Code workspace.
Not full-on painting, obviously—but enough for many small projects.

VS Code can handle things like vector graphics (SVG), simple diagram formats, and text-driven visuals.
If you describe a scene, Copilot can help generate a starting SVG illustration, and you can iterate from there.
It’s not about replacing professional design—it’s about making it easier to prototype, experiment,
and keep everything (text + assets) together in one organized place.

The Hidden Superpower: VS Code’s Ecosystem

Copilot is powerful on its own. But its real strength comes from where it lives.

VS Code brings the infrastructure:

  • Extensions for almost any workflow
  • Remote development over SSH
  • Integrated terminals and tasks
  • Search across files and folders
  • Versioning and history
  • Cross-platform consistency

So whether you’re configuring a server, drafting a runbook, organizing a book, or building a folder-based project,
Copilot adapts because the workspace defines the context.

The Reframe

If there’s one idea worth keeping, it’s this:

GitHub Copilot is not a coding tool. It’s a general-purpose work companion that happens to be excellent at code.

Once you stop limiting it to source files, it becomes:

  • A sysadmin assistant
  • A documentation partner
  • A creative editor
  • A workflow accelerator
  • A “second brain” inside the tools you already use

And the best part is that none of this requires a new platform or a new habit.
It’s the same VS Code workspace you already know—just used for more than code.

 

The Mirage of a Memory Leak (or: why “it must be the framework” is usually wrong)

The Mirage of a Memory Leak (or: why “it must be the framework” is usually wrong)

There is a familiar moment in every developer’s life.

Memory usage keeps creeping up.
The process never really goes down.
After hours—or days—the application feels heavier, slower, tired.

And the conclusion arrives almost automatically:

“The framework has a memory leak.”
“That component library is broken.”
“The GC isn’t doing its job.”

It’s a comforting explanation.

It’s also usually wrong.

Memory Leaks vs. Memory Retention

In managed runtimes like .NET, true memory leaks are rare.
The garbage collector is extremely good at reclaiming memory.
If an object is unreachable, it will be collected.

What most developers call a “memory leak” is actually
memory retention.

  • Objects are still referenced
  • So they stay alive
  • Forever

From the GC’s point of view, nothing is wrong.

From your point of view, RAM usage keeps climbing.

Why Frameworks Are the First to Be Blamed

When you open a profiler and look at what’s alive, you often see:

  • UI controls
  • ORM sessions
  • Binding infrastructure
  • Framework services

So it’s natural to conclude:

“This thing is leaking.”

But profilers don’t answer why something is alive.
They only show that it is alive.

Framework objects are usually not the cause — they are just sitting at the
end of a reference chain that starts in your code.

The Classic Culprit: Bad Event Wiring

The most common “mirage leak” is caused by events.

The pattern

  • A long-lived publisher (static service, global event hub, application-wide manager)
  • A short-lived subscriber (view, view model, controller)
  • A subscription that is never removed

That’s it. That’s the leak.

Why it happens

Events are references.
If the publisher lives for the lifetime of the process, anything it
references also lives for the lifetime of the process.

Your object doesn’t get garbage collected.

It becomes immortal.

The Immortal Object: When Short-Lived Becomes Eternal

An immortal object is an object that should be short-lived
but can never be garbage collected because it is still reachable from a GC
root.

Not because of a GC bug.
Not because of a framework leak.
But because our code made it immortal.

Static fields, singletons, global event hubs, timers, and background services
act as anchors. Once a short-lived object is attached to one of these, it
stops aging.

GC Root
  └── static / singleton / service
        └── Event, timer, or callback
              └── Delegate or closure
                    └── Immortal object
                          └── Large object graph

From the GC’s perspective, everything is valid and reachable.
From your perspective, memory never comes back down.

A Retention Dependency Tree That Cannot Be Collected

GC Root
  └── static GlobalEventHub.Instance
        └── GlobalEventHub.DataUpdated (event)
              └── delegate → CustomerViewModel.OnDataUpdated
                    └── CustomerViewModel
                          └── ObjectSpace / DbContext
                                └── IdentityMap / ChangeTracker
                                      └── Customer, Order, Invoice, ...

What you see in the memory dump:

  • thousands of entities
  • ORM internals
  • framework objects

What actually caused it:

  • one forgotten event unsubscription

The Lambda Trap (Even Worse, Because It Looks Innocent)

The code

public CustomerViewModel(GlobalEventHub hub)
{
    hub.DataUpdated += (_, e) =>
    {
        RefreshCustomer(e.CustomerId);
    };
}

This lambda captures this implicitly.
The compiler creates a hidden closure that keeps the instance alive.

“But I Disposed the Object!”

Disposal does not save you here.

  • Dispose does not remove event handlers
  • Dispose does not break static references
  • Dispose does not stop background work automatically

IDisposable is a promise — not a magic spell.

Leak-Hunting Checklist

Reference Roots

  • Are there static fields holding objects?
  • Are singletons referencing short-lived instances?
  • Is a background service keeping references alive?

Events

  • Are subscriptions always paired with unsubscriptions?
  • Are lambdas hiding captured references?

Timers & Async

  • Are timers stopped and disposed?
  • Are async loops cancellable?

Profiling

  • Follow GC roots, not object counts
  • Inspect retention paths
  • Ask: who is holding the reference?

Final Thought

Frameworks rarely leak memory.

We do.

Follow the references.
Trust the GC.
Question your wiring.

That’s when the mirage finally disappears.

 

As an XAF Developer, What Should I Actually Test?

As an XAF Developer, What Should I Actually Test?

This is a story about testing XAF applications — and why now is finally the right time to do it properly.

With Copilot agents and AI-assisted coding, writing code has become cheaper and faster than ever. Features that used to take days now take hours. Boilerplate is almost free.

And that changes something important.

For the first time, many of us actually have time to do the things we always postponed:

  • documenting the source code,
  • writing proper user manuals,
  • and — yes — writing tests.

But that immediately raises the real question:

What kind of tests should I even write?

Most developers use “unit tests” as a synonym for “tests”. But once you move beyond trivial libraries and into real application frameworks, that definition becomes fuzzy very quickly.

And nowhere is that more obvious than in XAF.

I’ve been working with XAF for something like 15–18 years (I’ve honestly lost count). It’s my preferred application framework, and it’s incredibly productive — but testing it “as-is” can feel like wrestling a framework-shaped octopus.

So let’s clarify something first.


You don’t test the framework. You test your logic.

XAF already gives you a lot for free:

  • CRUD
  • UI generation
  • validation plumbing
  • security system
  • object lifecycle
  • persistence

DevExpress has already tested those parts — thousands of times, probably millions by now.

So you do not need to write tests like:

  • “Can ObjectSpace save an object?”
  • “Does XAF load a View?”
  • “Does the security system work?”

You assume those things work.

Your responsibility is different.

You test the decisions your application makes.

That principle applies to XAF — and honestly, to any serious application framework.


The mental shift: what is a “unit”, really?

In classic theory, a unit is the smallest piece of code with a single responsibility — usually a method.

In real applications, that definition is often too small to be useful.

Sometimes the real “unit” is:

  • a workflow,
  • a business decision,
  • a state transition,
  • or a rule spanning multiple objects.

In XAF especially, the decision matters more than the method.

That’s why the right question is not “how do I unit test XAF?”
The right question is:

Which decisions in my app are important enough to protect?


The test pyramid for XAF

A practical, realistic test pyramid for XAF looks like this:

  1. Fast unit tests for pure logic
  2. Unit tests with thin seams around XAF-specific dependencies
  3. Integration tests with a real ObjectSpace (confidence tests)
  4. Minimal UI tests only for critical wiring

Let’s go layer by layer.


1) Push logic out of XAF into plain services (fast unit tests)

This is the biggest win you’ll ever get.

The moment you move important logic out of:

  • Controllers
  • Rules
  • ObjectSpace-heavy code

…testing becomes boring — and boring is good.

Put non-UI logic into:

  • Domain services (e.g. IInvoicePricingService)
  • Use-case handlers (CreateInvoiceHandler, PostInvoiceHandler)
  • Pure methods (no ObjectSpace, no View, no security calls)

Now you can test with plain xUnit / NUnit and simple mocks or fakes.

What is a service?

A service is code that makes business decisions.

It answers questions like:

  • “Can this invoice be posted?”
  • “Is this discount valid?”
  • “What is the total?”
  • “Is the user allowed to approve this?”

A service:

  • contains real logic
  • is framework-agnostic
  • is the thing you most want to unit test

If code decides why something happens, it belongs in a service.


2) Unit test XAF-specific logic with thin seams

Some logic will always touch XAF concepts. That’s fine.

The trick is not to eliminate XAF — it’s to isolate it.

You do that by introducing seams.

What is a seam?

A seam is a boundary where you can replace a real dependency with a fake one in a test.

A seam:

  • usually contains no business logic
  • exists mainly for testability
  • is often an interface or wrapper

Common XAF seams:

  • ICurrentUser instead of SecuritySystem.CurrentUser
  • IClock instead of DateTime.Now
  • repositories / unit-of-work instead of raw IObjectSpace
  • IUserNotifier instead of direct UI calls

Seams don’t decide anything — they just let you escape the framework in tests.

What does “adapter” mean in XAF?

An adapter is a very thin class whose job is to:

  • translate XAF concepts (View, ObjectSpace, Actions, Rules)
  • into calls to your services and use cases

Adapters:

  • contain little or no business logic
  • are allowed to be hard to unit test
  • exist to connect XAF to your code

Typical XAF adapters:

  • Controllers
  • Appearance Rules
  • Validation Rules
  • Action handlers
  • Property setters that delegate to services

The adapter is not the brain.
The brain lives in services.

What should you test here?

  • Appearance Rules
    Test the decision behind the rule (e.g. “Is this field editable now?”).
    Then confirm via integration tests that the rule is wired correctly.
  • Validation Rules
    Test the validation logic itself (conditions, edge cases).
    Optionally verify that the XAF rule triggers when expected.
  • Calculated properties / non-trivial setters
  • Controller decision logic once extracted from the Controller

3) Integration tests with a real ObjectSpace (confidence tests)

Unit tests prove your logic is correct.

Integration tests prove your XAF wiring still behaves.

They answer questions like:

  • Does persistence work?
  • Do validation and appearance rules trigger?
  • Do lifecycle hooks behave?
  • Does security configuration work as expected?

4) Minimal UI tests (only for critical wiring)

UI automation is expensive and fragile.

Keep UI tests only for:

  • Critical actions
  • Essential navigation flows
  • Known production regressions

The key mental model

A rule is not the unit.
The decision behind the rule is the unit.

Test the decision directly.
Use integration tests to confirm the glue still works.


Closing thought

Test your app’s decisions, not the framework’s behavior.

That’s the difference between a test suite that helps you move faster
and one that quietly turns into a tax.