From Frontier Models to Local Specialists: My Mac Studio Experiment

From Frontier Models to Local Specialists: My Mac Studio Experiment

I was recently in Arizona and got my hands on a Mac Studio (M1 Ultra, 64GB RAM).

Naturally, I didn’t use it for video editing or music production. I used it for what actually matters:

Running local AI models and trying to bend them to my will.

The Setup

I went all-in and installed three main tools:

The goal was simple:

Try every possible way to make open-source models behave like something useful.

Not just chatting. Not just “vibe coding.” Actual controlled behavior.

The Reality Check

After working with frontier models like those from OpenAI and Anthropic, one thing becomes obvious:

They’re just better.

Not because of magic.

Because of:

  • more training data
  • better curated data
  • more compute
  • more iterations
  • better eval loops

It’s not one thing. It’s everything combined.

You feel it immediately:

  • better reasoning
  • better language coverage
  • better consistency
  • fewer weird edge cases

So What’s the Point of Local Models?

This is where things get interesting.

Instead of trying to compete with frontier models, I flipped the approach:

Don’t build a general intelligence. Build a specialist.

The Core Idea: Single-Task Specialists

Instead of asking:

“How do I make this model as good as GPT or Claude?”

Ask:

“How do I make this model extremely good at ONE thing?”

That changes everything.

The Stack of Techniques

1. Base Model Selection

Small to mid-size models:

  • 7B–8B range works best locally

2. RAG (Retrieval-Augmented Generation)

Inject knowledge at runtime:

  • docs
  • codebases
  • tutorials
  • architecture notes

3. Behavior Tuning (LoRA / fine-tuning)

Instead of teaching knowledge, teach patterns:

  • how to respond
  • how to structure outputs
  • how to write code
  • how to fix bugs

4. Tooling / Skills

Give the model bounded capabilities:

  • limited toolset
  • deterministic flows
  • controlled execution

The Big Insight

The breakthrough for me was this:

Knowledge ≠ Behavior

You don’t train the model to know everything. You train it to behave correctly within a narrow domain.

Example: Framework Specialist

Let’s say you take a framework like DevExpress XAF.

You can build a local specialist by combining:

Knowledge Layer (RAG)

  • official docs
  • tutorials
  • your internal patterns
  • real-world examples

Behavior Layer (fine-tuning)

  • “user asks → correct implementation”
  • “buggy code → fixed code”
  • “requirement → test → solution”

Tool Layer

  • code generation
  • validation scripts
  • test runners

Now instead of a generic model, you get:

A XAF expert agent that runs locally.

Why This Works

Because you’re not fighting the limitations anymore.

You’re embracing them.

Small models:

  • are faster
  • are cheaper
  • are controllable
  • are predictable

And when you narrow the scope:

They become surprisingly good.

The Direction I’m Exploring

What really got my attention is this idea:

Simple models + simple tools + strong constraints = reliable systems

Not:

  • giant models
  • huge toolchains
  • uncontrolled agents

But:

  • small models
  • few tools
  • explicit rules
  • tight loops

What’s Next

I’m going to keep experimenting and document everything:

  • different models (Qwen, Llama, etc.)
  • different training approaches (LoRA, no training, RAG-only)
  • different tool strategies
  • evaluation setups
  • failure cases (this is the important part)

Final Thought

Frontier models are incredible.

But local models are hackable.

And that’s the real advantage.

You can shape them.
You can constrain them.
You can turn them into specialists.

And for real-world systems, that might actually be more valuable than raw intelligence.

 

How CLI Tools Can Drastically Reduce Token Consumption ⚡

How CLI Tools Can Drastically Reduce Token Consumption ⚡

After discovering that tools were silently burning my context window, I had a new problem:

How do I keep powerful agents without paying the cost of massive schemas?

The answer did not come from AI.

It came from something much older.

👉 The command line.

🧠 The Problem with Traditional Tools

Most agent tools today are defined with JSON schemas, nested parameters, verbose descriptions, and strict typing.

They look clean to us as developers.

But to an LLM?

👉 They are heavy prompt payloads.

Every request includes:

  • full schema
  • all parameters
  • all descriptions

Even if the user only says:

“order coffee”

💥 Why This Does Not Scale

Let’s say you have:

  • 30 tools
  • each with 800 tokens

That means:

👉 24,000 tokens before the user even speaks

And most of those tokens are:

  • never used
  • rarely relevant
  • repeated every time

🔍 Rethinking Tools

So I asked myself:

Do tools really need to be JSON schemas?

Or…

👉 Do they just need to be understandable commands?

⚡ Enter CLI-Style Tools

Instead of this:

{
  "name": "create_order",
  "parameters": {
    "userId": "...",
    "productId": "...",
    "quantity": "..."
  }
}

You define tools like this:

create_order uid pid qty

Or even:

order coffee size=large sugar=0

🧬 Why This Works

Because LLMs are incredibly good at:

  • parsing text
  • understanding intent
  • filling structured patterns

You do not need:

  • deep JSON
  • verbose schemas
  • long descriptions

👉 You just need clear syntax

📉 Token Cost Comparison

JSON Tool

~800–1500 tokens

CLI Tool

~10–30 tokens

👉 That is a 50x–100x reduction

🧠 Less Structure, Better Reasoning

With CLI tools:

  • fewer tokens means more room for conversation
  • simpler format means easier decisions
  • less noise means better accuracy

The model does not need to:

  • parse nested JSON
  • match schemas
  • validate deep structures

It just:

👉 generates a command

⚙️ How It Looks in Practice

User says:

“I want a large coffee with no sugar”

Agent outputs:

order_coffee size=large sugar=0

Your backend:

  • parses the string
  • validates parameters
  • executes the action

🧩 You Move Complexity Out of the Model

This is the key shift:

Before:

  • the model handles structure
  • the model handles validation
  • the model handles formatting

After:

  • the model generates intent
  • your code handles everything else

👉 Which is exactly what we want as engineers

🚀 Benefits for Real Systems

If you are building something like:

  • WhatsApp or Telegram agents
  • multi-domain assistants

CLI tools give you:

  • massive token savings
  • faster responses
  • lower cost per user
  • better scalability
  • simpler tool definitions

⚠️ Trade-offs

CLI tools are not magic.

You lose:

  • strict schema validation in the prompt
  • automatic argument formatting
  • some guardrails

So you must:

  • validate inputs in your backend
  • handle errors gracefully
  • design clean command syntax

🧠 Best Practices

1. Keep commands short

order_coffee
book_yoga
pay_invoice

2. Use key=value pairs

book_yoga date=2026-03-28 time=10:00

3. Avoid ambiguity

Bad:

order stuff

Good:

order_food item=pizza size=large

4. Build a parser layer in C#

This fits perfectly with a .NET stack:

  • Regex or tokenizer
  • map to DTO
  • validate
  • execute

5. Combine with routing

The best setup looks like this:

Router → Domain → CLI tools

Now you get:

  • small toolsets
  • tiny prompts
  • efficient agents

💡 The Big Insight

After all this, I realized:

JSON schemas are for machines.
CLI commands are for language models.

🏁 Conclusion

The goal is not to remove tools.

The goal is to:

👉 make tools cheaper to think about

CLI-style tools:

  • reduce tokens
  • simplify reasoning
  • scale better

And most importantly:


They let your agent focus on what actually matters — understanding the user.

From “Hello” to Quota Exceeded: The Day My Agent Broke 💥

From “Hello” to Quota Exceeded: The Day My Agent Broke 💥

After testing OpenClaw, something clicked.

The future is not chat.

👉 The future is agents.

🛠️ Building the “Perfect” Agent

I started designing what I thought would be the ultimate assistant:

  • General purpose
  • Connected to everything
  • Capable of doing real tasks

And to make that happen…

I built tools.
A lot of tools.

Not generic ones — very specific tools:

  • booking flows
  • ordering systems
  • logistics
  • payments
  • daily life actions

Before I knew it…

👉 My agent had around 50 custom tools

And honestly, it felt powerful.

💡 The Business Idea

The plan was simple:

  • Give users a few free tokens per day
  • Let them try the agent
  • Hook them with real utility

A classic freemium model.

💥 Reality Hit Immediately

What actually happened?

Users would send:

“Hello”

…and then…

👉 Quota exceeded

Not after a conversation.
Not after a task.
After the second request.

🤨 That Made No Sense

At first, I thought:

  • Maybe there’s a bug
  • Maybe token counting is wrong
  • Maybe pricing is off

But everything checked out.

Still:

  • almost no conversation
  • almost no output
  • quota gone

🧠 That’s When I Started Digging

So I did what we always do:

👉 I looked under the hood

And what I found changed how I think about agents completely.

🔍 The Hidden Cost of Tools

I realized something critical:

My agent wasn’t just sending messages.
It was sending all 50 tools on every request.

Every. Single. Time.

📦 What That Actually Means

Each tool had:

  • name
  • description
  • parameters
  • JSON schema
  • nested objects

Individually? Fine.

Together?

👉 Massive.

So even a simple request like:

“Hello”

Was actually being processed like:

[system prompt]
[conversation]
[50 tool definitions]
[user: Hello]

🔥 I Was Burning Tokens Without Knowing

That’s when it clicked.

The user wasn’t paying for:

  • the message
  • the response

They were paying for:

👉 the entire toolset injected into the prompt

📉 Why My Quota Disappeared Instantly

Let’s do the math.

  • each tool ≈ 600–1000 tokens
  • I had ~50 tools

👉 I was sending 30,000–50,000 tokens per request

For a “Hello”.

No wonder the quota was gone after two messages.

😳 The Illusion of “Light Usage”

From the user’s perspective:

  • they typed almost nothing
  • they got almost nothing

From the system’s perspective:

👉 It processed a massive prompt

🧬 The Realization

That’s when I understood:

Tools are not just capabilities.
Tools are context weight.

Every tool:

  • consumes tokens
  • competes for attention
  • increases cost

⚠️ The Bigger Problem

It wasn’t just cost.

The agent was also:

  • slower
  • less accurate
  • sometimes picking the wrong tool

Because it had to:

👉 reason over 50 options every time

🧠 The Shift in Thinking

Before:

“More tools = smarter agent”

After:

“More tools = heavier prompt = worse performance”

🚀 What This Changed for Me

I stopped trying to build:

❌ One agent that does everything

And started designing:

✅ Systems that load only what’s needed

🧩 The New Approach

Instead of:

Agent → 50 tools

I moved to:

User → Router → Domain Agent → 5 tools

Now:

  • smaller prompts
  • lower cost
  • better decisions

💡 Final Insight

That experience taught me something simple but powerful:


If your agent feels expensive, slow, or dumb…
check how many tools you’re injecting into the prompt.

Because sometimes:

👉 You’re not scaling intelligence
👉 You’re scaling tokens

🏁 Closing

That “Hello → quota exceeded” moment was frustrating.

But it revealed a fundamental truth about agents:


The problem is not how many tools you have.
The problem is how many you send every time.

And once you see that…

You start building agents very differently.

The Day I Integrated GitHub Copilot SDK Inside My XAF App (Part 1)

The Day I Integrated GitHub Copilot SDK Inside My XAF App (Part 1)

A strange week

This week I was going to the university every day to study Russian.

Learning a new language as an adult is a very humbling experience. One moment you are designing enterprise architectures, and the next moment you are struggling to say:

me siento bien
which in Russian is: я чувствую себя хорошо

So like any developer, I started cheating immediately.

I began using AI for everything:

  • ChatGPT to review my exercises
  • GitHub Copilot inside VS Code correcting my grammar
  • Sometimes both at the same time

It worked surprisingly well. Almost too well.

At some point during the week, while going back and forth between my Russian homework and my development work, I noticed something interesting.

I was using several AI tools, but the one I kept returning to the most — without even thinking about it — was GitHub Copilot inside Visual Studio Code.

Not in the browser. Not in a separate chat window. Right there in my editor.

That’s when something clicked.

Two favorite tools

XAF is my favorite application framework. I’ve built countless systems with it — ERPs, internal tools, experiments, prototypes.

GitHub Copilot has become my favorite AI agent.

I use it constantly:

  • writing code
  • reviewing ideas
  • fixing small mistakes
  • even correcting my Russian exercises

And while using Copilot so much inside Visual Studio Code, I started thinking:

What would it feel like to have Copilot inside my own applications?

Not next to them. Inside them.

That idea stayed in my head for a few days until curiosity won.

The innocent experiment

I discovered the GitHub Copilot SDK.

At first glance it looked simple: a .NET library that allows you to embed Copilot into your own applications.

My first thought:

“Nice. This should take 30 minutes.”

Developers should always be suspicious of that sentence.

Because it never takes 30 minutes.

First success (false confidence)

The initial integration was surprisingly easy.

I managed to get a basic response from Copilot inside a test environment. Seeing AI respond from inside my own application felt a bit surreal.

For a moment I thought:

Done. Easy win.

Then I tried to make it actually useful.

That’s when the adventure began.

The rabbit hole

I didn’t want just a chatbot.

I wanted an agent that could actually interact with the application.

Ask questions. Query data. Help create things.

That meant enabling tool calling and proper session handling.

And suddenly everything started failing.

Timeouts. Half responses. Random behavior depending on the model. Sessions hanging for no clear reason.

At first I blamed myself.

Then my integration. Then threading. Then configuration.

Three or four hours later, after trying everything I could think of, I finally discovered the real issue:

It wasn’t my code.
It was the model.

Some models were timing out during tool calls. Others worked perfectly.

The moment I switched models and everything suddenly worked was one of those small but deeply satisfying developer victories.

You know the moment.

You sit back. Look at the screen. And just smile.

The moment it worked

Once everything was connected properly, something changed.

Copilot stopped feeling like a coding assistant and started feeling like an agent living inside the application.

Not in the IDE. Not in a browser tab. Inside the system itself.

That changes the perspective completely.

Instead of building forms and navigation flows, you start thinking:

What if the user could just ask?

Instead of:

  • open this screen
  • filter this grid
  • generate this report

You imagine:

  • “Show me what matters.”
  • “Create what I need.”
  • “Explain this data.”

The interface becomes conversational.

And once you see that working inside your own application, it’s very hard to unsee it.

Why this experiment mattered to me

This wasn’t about building a feature for a client. It wasn’t even about shipping production code.

Most of my work is research and development. Prototypes. Ideas. Experiments.

And this experiment changed the way I see enterprise applications.

For decades we optimized screens, menus, and workflows.

But AI introduces a completely different interaction model.

One where the application is no longer just something you navigate.

It’s something you talk to.

Also… Russian homework

Ironically, this whole experiment started because I was trying to survive my Russian classes.

Using Copilot to correct grammar. Using AI to review exercises. Switching constantly between tools.

Eventually that daily workflow made me curious:

What happens if Copilot is not next to my application, but inside it?

Sometimes innovation doesn’t start with a big strategy.

Sometimes it starts with curiosity and a small personal frustration.

What comes next

This is just the beginning.

Now that AI can live inside applications:

  • conversations can become interfaces
  • tools can be invoked by language
  • workflows can become more flexible

We are moving from:

software you operate

to:

software you collaborate with

And honestly, that’s a very exciting direction.

Final thought

This entire journey started with a simple curiosity while studying Russian and writing code in the same week.

A few hours of experimentation later, Copilot was living inside my favorite framework.

And now I can’t imagine going back.


Note: The next article will go deep into the technical implementation — the architecture, the service layer, tool calling, and how I wired everything into XAF for both Blazor and WinForms.

 

Closing the Loop with AI (part 3): Moving the Human to the End of the Pipeline

Closing the Loop with AI (part 3): Moving the Human to the End of the Pipeline

My last two articles have been about one idea: closing the loop with AI.

Not “AI-assisted coding.” Not “AI that helps you write functions.”
I’m talking about something else entirely.

I’m talking about building systems where the agent writes the code, tests the code, evaluates the result,
fixes the code, and repeats — without me sitting in the middle acting like a tired QA engineer.

Because honestly, that middle position is the worst place to be.

You get exhausted. You lose objectivity. And eventually you look at the project and think:
everything here is garbage.

So the goal is simple:

Remove the human from the middle of the loop.

Place the human at the end of the loop.

The human should only confirm: “Is this what I asked for?”
Not manually test every button.

The Real Question: How Do You Close the Loop?

There isn’t a single answer. It depends on the technology stack and the type of application you’re building.
So far, I’ve been experimenting with three environments:

  • Console applications
  • Web applications
  • Windows Forms applications (still a work in progress)

Each one requires a slightly different strategy.

But the core principle is always the same:

The agent must be able to observe what it did.

If the agent cannot see logs, outputs, state, or results — the loop stays open.

Console Applications: The Easiest Loop to Close

Console apps are the simplest place to start.

My setup is minimal and extremely effective:

  • Serilog writing structured logs
  • Logs written to the file system
  • Output written to the console

Why both?

Because the agent (GitHub Copilot in VS Code) can run the app, read console output, inspect log files,
decide what to fix, and repeat.

No UI. No browser. No complex state.
Just input → execution → output → evaluation.

If you want to experiment with autonomous loops, start here. Console apps are the cleanest lab environment you’ll ever get.

Web Applications: Where Things Get Interesting

Web apps are more complex, but also more powerful.

My current toolset:

  • Serilog for structured logging
  • Logs written to filesystem
  • SQLite for loop-friendly database inspection
  • Playwright for automated UI testing

Even if production uses PostgreSQL or SQL Server, I use SQLite during loop testing.
Not for production. For iteration.

The SQLite CLI makes inspection trivial.
The agent can call the API, trigger workflows, query SQLite directly, verify results, and continue fixing.

That’s a full feedback loop. No human required.

Playwright: Giving the Agent Eyes

For UI testing, Playwright is the key.

You can run it headless (fully autonomous) or with UI visible (my preferred mode).

Yes, I could remove myself completely. But I don’t.
Right now I sit outside the loop as an observer.
Not a tester. Not a debugger. Just watching.

If something goes completely off the rails, I interrupt.
Otherwise, I let the loop run.

This is an important transition:

From participant → to observer.

The Windows Forms Problem

Now comes the tricky part: Windows Forms.

Console apps are easy. Web apps have Playwright.
But desktop UI automation is messy.

Possible directions I’m exploring:

  • UI Automation APIs
  • WinAppDriver
  • Logging + state inspection hybrid approach
  • Screenshot-based verification
  • Accessibility tree inspection

The goal remains the same: the agent must be able to verify what happened without me.

Once that happens, the loop closes.

What I’ve Learned So Far

1) Logs Are Everything

If the agent cannot read what happened, it cannot improve. Structured logs > pretty logs. Always.

2) SQLite Is the Perfect Loop Database

Not for production. For iteration. The ability to query state instantly from CLI makes autonomous debugging possible.

3) Agents Need Observability, Not Prompts

Most people focus on prompt engineering. I focus on observability engineering.
Give the agent visibility into logs, state, outputs, errors, and the database. Then iteration becomes natural.

4) Humans Should Validate Outcomes — Not Steps

The human should only answer: “Is this what I asked for?” That’s what the agent is for.

My Current Loop Architecture (Simplified)

Specification → Agent writes code → Agent runs app → Agent tests → Agent reads logs/db →
Agent fixes → Repeat → Human validates outcome

If the loop works, progress becomes exponential.
If the loop is broken, everything slows down.

My Question to You

This is still evolving. I’m refining the process daily, and I’m convinced this is how development will work from now on:
agents running closed feedback loops with humans validating outcomes at the end.

So I’m curious:

  • What tooling are you using?
  • How are you creating feedback loops?
  • Are you still inside the loop — or already outside watching it run?

Because once you close the loop…
you don’t want to go back.

 

Closing the Loop (Part 2): So Far, So Good — and Yes, It’s Token Hungry

Closing the Loop (Part 2): So Far, So Good — and Yes, It’s Token Hungry

I wrote my previous article about closing the loop for agentic development earlier this week, although the ideas themselves have been evolving for several days. This new piece is simply a progress report: how the approach is working in practice, what I’ve built so far, and what I’m learning as I push deeper into this workflow.

Short version: it’s working.
Long version: it’s working really well — but it’s also incredibly token-hungry.

Let’s talk about it.

A Familiar Benchmark: The Activity Stream Problem

Whenever I want to test a new development approach, I go back to a problem I know extremely well: building an activity stream.

An activity stream is basically the engine of a social network — posts, reactions, notifications, timelines, relationships. It touches everything:

  • Backend logic
  • UI behavior
  • Realtime updates
  • State management
  • Edge cases everywhere

I’ve implemented this many times before, so I know exactly how it should behave. That makes it the perfect benchmark for agentic development. If the AI handles this correctly, I know the workflow is solid.

This time, I used it to test the closing-the-loop concept.

The Current Setup

So far, I’ve built two main pieces:

  1. An MCP-based project
  2. A Blazor application implementing the activity stream

But the real experiment isn’t the app itself — it’s the workflow.

Instead of manually testing and debugging, I fully committed to this idea:

The AI writes, tests, observes, corrects, and repeats — without me acting as the middleman.

So I told Copilot very clearly:

  • Don’t ask me to test anything
  • You run the tests
  • You fix the issues
  • You verify the results

To make that possible, I wired everything together:

  • Playwright MCP for automated UI testing
  • Serilog logging to the file system
  • Screenshot capture of the UI during tests
  • Instructions to analyze logs and fix issues automatically

So the loop becomes:

write → test → observe → fix → retest

And honestly, I love it.

My Surface Is Working. I’m Not Touching It.

Here’s the funny part.

I’m writing this article on my MacBook Air.

Why?

Because my main development machine — a Microsoft Surface laptop — is currently busy running the entire loop by itself.

I told Copilot to open the browser and actually execute the tests visually. So it’s navigating the UI, filling forms, clicking buttons, taking screenshots… all by itself.

And I don’t want to touch that machine while it’s working.

It feels like watching a robot doing your job. You don’t interrupt it mid-task. You just observe.

So I switched computers and thought: “Okay, this is a perfect moment to write about what’s happening.”

That alone says a lot about where this workflow is heading.

Watching the Loop Close

Once everything was wired together, I let it run.

The agent:

  • Writes code
  • Runs Playwright tests
  • Reads logs
  • Reviews screenshots
  • Detects issues
  • Fixes them
  • Runs again

Seeing the system self-correct without constant intervention is incredibly satisfying.

In traditional AI-assisted development, you often end up exhausted:

  • The AI gets stuck
  • You explain the issue
  • It half-fixes it
  • You explain again
  • Something else breaks

You become the translator and debugger for the model.

With a self-correcting loop, that burden drops dramatically. The system can fail, observe, and recover on its own.

That changes everything.

The Token Problem (Yes, It’s Real)

There is one downside: this workflow is extremely token hungry.

Last month I used roughly 700% more tokens than usual. This month, and we’re only around February 8–9, I’ve already used about 200% of my normal limits.

Why so expensive?

Because the loop never sleeps:

  • Test execution
  • Log analysis
  • Screenshot interpretation
  • Code rewriting
  • Retesting
  • Iteration

Every cycle consumes tokens. And when the system is autonomous, those cycles happen constantly.

Model Choice Matters More Than You Think

Another important detail: not all models consume tokens equally inside Copilot.

Some models count as:

  • 3× usage
  • 1× usage
  • 0.33× usage
  • 0× usage

For example:

  • Some Anthropic models are extremely good for testing and reasoning
  • But they can count as 3× token usage
  • Others are cheaper but weaker
  • Some models (like GPT-4 Mini or GPT-4o in certain Copilot tiers) count as toward limits

At some point I actually hit my token limits and Copilot basically said: “Come back later.”

It should reset in about 24 hours, but in the meantime I switched to the 0× token models just to keep the loop running.

The difference in quality is noticeable.

The heavier models are much better at:

  • Debugging
  • Understanding logs
  • Self-correcting
  • Complex reasoning

The lighter or free models can still work, but they struggle more with autonomous correction.

So model selection isn’t just about intelligence — it’s about token economics.

Why It’s Still Worth It

Yes, this approach consumes more tokens.

But compare that to the alternative:

  • Sitting there manually testing
  • Explaining the same bug five times
  • Watching the AI fail repeatedly
  • Losing mental energy on trivial fixes

That’s expensive too — just not measured in tokens.

I would rather spend tokens than spend mental fatigue.

And realistically:

  • Models get cheaper every month
  • Tooling improves weekly
  • Context handling improves
  • Local and hybrid options are evolving

What feels expensive today might feel trivial very soon.

MCP + Blazor: A Perfect Testing Ground

So far, this workflow works especially well for:

  • MCP-based systems
  • Blazor applications
  • Known benchmark problems

Using a familiar problem like an activity stream lets me clearly measure progress. If the agent can build and maintain something complex that I already understand deeply, that’s a strong signal.

Right now, the signal is positive.

The loop is closing. The system is self-correcting. And it’s actually usable.

What Comes Next

This article is just a status update.

The next one will go deeper into something very important:

How to design self-correcting mechanisms for agentic development.

Because once you see an agent test, observe, and fix itself, you don’t want to go back to manual babysitting.

For now, though:

The idea is working. The workflow feels right. It’s token hungry. But absolutely worth it.

Closing the loop isn’t theory anymore — it’s becoming a real development style.