Price Is a Feature: The Case for Cheap and Local AI Models

I'll admit it up front: I've got the local-model disease. It's a real plague going around developer
circles right now — new open-weight models dropping every week, each one begging to be downloaded and
benchmarked, your Python environments quietly rotting into incompatible mush as you try them all. I
have it. I don't want a cure. And the more I sit with it, the more I think the obsession is pointing at
something real that most of the agentic-coding discourse keeps missing.

Here it is, the whole thesis in four words: price is a feature.

"Performance is a feature" — and now price is too

There's an old line in software that performance is a feature. The speed of your app isn't a
nice-to-have you'll "get to later" — it's a property users feel on every interaction, and it shapes
whether they enjoy using the thing at all. Latency is a feature. Responsiveness is a feature.

In the agentic-AI era, price has joined that list, and I don't think it gets nearly enough respect.
Everyone benchmarks intelligence. Almost nobody treats cost as a first-class product property of the
model. But it is one — because cost doesn't just show up on your invoice. It shows up in your head.

The mental tax is the real cost

Think about how you actually behave when you're driving an expensive frontier model. Every time you go
to hit Enter, there's a little voice: better word this perfectly, because this prompt is about to cost
a dollar, and if I do a sloppy job I'll have to spend another two or three fixing it. You start
rationing. You over-engineer the prompt. You hesitate. You stop experimenting, because experiments are
expensive and most experiments fail.

That hesitation is a tax on your thinking, and it's worse than the dollar amount, because it changes
how you work. The whole promise of agentic development is fast, cheap iteration — try something, look
at the result, adjust, try again. A pricing model that punishes iteration is quietly fighting the
workflow it's supposed to enable.

A cheap model deletes that tax. When a model costs you something like 75 cents per million tokens
instead of dollars per prompt, the little voice goes quiet. You stop pre-optimizing every message.
You can be wasteful. You can — and someone on the Merge Conflict podcast put this perfectly — go back
to saying "good morning" to the model like you used to, just because you're willing to spend a
fraction of a cent to be polite. That sounds frivolous. It isn't. It's the sound of the friction being
gone.

Not everything needs the smartest model in the world

The frontier keeps getting more capable and, lately, more expensive — Opus-class models, GPT-5.5-class
models, million-token context windows, deep reasoning by default. They're extraordinary. They're also
massive overkill for most of what we actually do all day.

Here's the take I'll stand behind: not every agentic coding task requires the biggest model. Most
of them don't need a million-token context window. Most don't need maximum reasoning effort burning
thousands of thinking tokens. A huge fraction of real work — scaffolding, refactors, wiring up a known
pattern, writing a PRD, planning a feature, mechanical edits — is comfortably within reach of a small,
fast, cheap model. Reaching for a frontier model for all of it is like taking a freight train to the
corner store.

The skill that actually matters, then, isn't "use the best model." It's right-sizing the model to the
task. Cheap-and-fast for the 80% that's routine; expensive-and-smart for the 20% that's genuinely
hard. The developers who get good at that triage will ship more, for less, with less waiting — and
they'll spend their token budget where it actually buys them something.

Small, cheap models got shockingly good

The reason this is even a debate now is that the small end of the market got good. We're seeing
sub-30B-active-parameter models (often mixture-of-experts designs that only light up a few billion
parameters per token) posting numbers that would have been frontier-class not long ago. Microsoft's new
MAI family is a clean example: the thinking model lands in the neighborhood of a much pricier flagship
on coding benchmarks while burning fewer tokens, and the little "Flash" coding variant is pitched
against a Haiku-class model — cheaper, faster, and tuned to do real work.

And that's just the hosted side. On the open-weight side, the Qwen-class coding models, the new
multimodal Gemma releases, and a steady drip of others mean you can run something genuinely useful on
your own machine — especially as unified-memory hardware (Apple Silicon, the new NVIDIA-class
mini-machines with 128GB of unified RAM) makes hosting a 100B-plus parameter model at home go from
fantasy to "expensive but real." The local models aren't toys anymore. That's the whole reason the
disease is spreading.

Tokens wasted is money burned — so tuning to the harness matters

There's a subtler cost most people don't count: the tokens a model wastes. Anyone who's run an
agentic coder has watched it fumble a file edit — "failed to apply patch, I'll rewrite the whole file
from scratch" — and felt their stomach drop, because that botched merge just cost real money for
negative progress. Multiply that across a session and the waste is enormous.

This is why I'm genuinely excited about models being trained against the actual tool harness they'll
run in — the edit format, the file operations, the agentic loop. A model that was taught how its tools
behave makes fewer of those stupid, expensive mistakes. When a vendor says their coding model uses
~60% fewer tokens for the same task, the headline isn't just "cheaper" — it's "fewer failed merges,
fewer rewrites, less latency, smoother loops." Efficiency and reliability turn out to be the same
property viewed from two angles, and both of them are price.

(One honest caveat on comparing token counts: different models tokenize differently, so "fewer tokens"
is only a fair fight when two models share a tokenizer. When they don't, treat the number as a vibe,
not a measurement.)

If you take cheap-and-local seriously, you'll quickly want to know: is this little model actually good
enough for my work? And here's the uncomfortable answer — the public benchmarks won't tell you.
Leaderboards get gamed, contaminated, and optimized-for. The single most trustworthy signal left is the
blind A/B human eval: give model A your real prompt five times, give model B the same prompt five
times, then look at the ten results without knowing which is which and pick the ones you actually like.

It's not fancy, but it's the only test that survives contact with reality, and it's the right way to
answer "can my cheap model replace my expensive one for this job?" Set expectations honestly — if a
vendor compares their model to a Haiku-class tier, believe them and judge it on that scale — then prove
it on your tasks, not someone's benchmark suite.

Where this is heading: routing and fine-tuning

Two things follow from all of this, and I think they're the near future of building with AI.

First, routing. Once you accept that different tasks deserve different models, the natural move is
to stop hard-coding one provider and start sending each request to the cheapest model that can do the
job — falling back to the expensive one only when you need it. Aggregators like OpenRouter and
self-hosted gateways make this practical today, and "bring your own key / bring your own model" is
becoming a checkbox feature in the coding tools precisely because nobody wants to be locked to one
pricey default.

Second, fine-tuning. There's only so far you can push a model with context windows and system
prompts. The real lever — especially for businesses sitting on domain data — is tuning a small,
cheap, open model to your task until it punches well above its size. Frontier-tuning services are
showing up to make this turnkey, and the logic is irresistible: a small model you've shaped to your
workflow can beat a giant general-purpose one on the narrow thing you actually do, at a fraction of the
cost.

The disease is the right instinct

So no, I don't want a cure for the local-model disease. The obsession with cheap, hostable, runs-on-my-
own-hardware models isn't hoarding for its own sake — it's a bet that price is a feature, that
removing the mental tax on every prompt changes how well you can think, and that the future belongs to
people who route the right task to the right model instead of paying frontier prices for corner-store
work. The flagships are spectacular and I'll keep one in my toolbox for the genuinely hard problems.
But day to day? Cheap and local is winning, and it deserves a lot more respect than the hype cycle
gives it.

Next: a practical guide to actually running cheap and local AI
— OpenRouter, gateways, local model runners, and how to route a task to the right model.