Jun 24, 20268 min read/2026/06/24/deepseek-v4-vision-thinking-with-visual-primitives/

The Vision Model That Wasn't There: DeepSeek V4, a Vanished Paper, and a Recipe You Can Use Today

I had a clean little plan for a weekend: take a photo of a shelf, have a model return every item with a bounding box and a few attributes, and drop that into a personal inventory app. The model I'd lined up was DeepSeek V4, on the strength of a description that sounded purpose-built for the job — something about "full reinforcement-learning-backed reasoning to calculate bounding boxes and relative dimensions step by step before formatting the data," with a Pro tier for messy scenes and a Flash tier for speed.

It reads like a spec sheet for exactly what I wanted. There's just one issue.

DeepSeek V4 cannot see images. Not "it's not very good at it." It does not accept image input at all.

Here's how I found that out — three independent ways — the research paper that explains why the claim isn't pure fantasy, and the part that ended up actually mattering: what I built once I stopped chasing a model that doesn't ship.

Strike one: the API rejects the image before it even picks a model

My first move was the obvious one: send the model an image and ask what's in it. The models are real and live — deepseek-v4-pro and deepseek-v4-flash, both with a 1M-token context window, JSON output, and tool calls. So I built a standard OpenAI-style chat request with an image_url content part and fired it off.

{
  "error": {
    "message": "Failed to deserialize the JSON body into the target type:
                 messages[0]: unknown variant `image_url`, expected `text`",
    ...
  }
}

Read that error carefully, because the location of the failure is the whole story. This isn't "the model looked at your image and got confused." It's the API's request parser rejecting the shape of the message before any model is even consulted. The schema for a message only allows text content. There is no door for an image to walk through.

A different API key won't change that. A "config fix" won't change that. The contract itself has no image input.

Strike two: the catalog agrees

Maybe, I thought, there's an undocumented vision endpoint or a model name not in the list. So I checked the broader catalog — every DeepSeek model exposed through an aggregator, with their declared input modalities:

deepseek-v4-pro          modalities=['text']
deepseek-v4-flash        modalities=['text']
deepseek-v3.2            modalities=['text']
deepseek-chat-v3.1       modalities=['text']
deepseek-r1              modalities=['text']
... (11 total)

Eleven DeepSeek models. Every single one: text. I probed half a dozen plausible vision-model names directly — deepseek-vl, deepseek-vl2, deepseek-vision, deepseek-v4-pro-vision, even a beta endpoint. Identical rejection every time, at the same JSON-deserialization layer.

Strike three: the docs say nothing because there's nothing to say

Last stop, the official documentation. The model pages list exactly four names (deepseek-v4-pro, deepseek-v4-flash, and the now-deprecated deepseek-chat/deepseek-reasoner, which retire on 2026-07-24 and fold into V4-Flash). The documented capabilities: JSON mode, function calling, reasoning/thinking mode, FIM, 1M context.

The word "image" does not appear. Neither does "vision" or "multimodal." Not as a limitation, not as a roadmap item — it simply isn't part of the product.

Three independent sources, one conclusion: there is no DeepSeek model, on any endpoint, that accepts an image. The "V4 does bounding boxes from photos" description is describing something that you cannot call.

So where did the claim come from? Because it's too specific to be a hallucination. "Bounding boxes and relative dimensions, step by step" is not the kind of thing a marketing bot invents from nothing.

The plot twist: the paper that was published, then pulled

It turns out DeepSeek absolutely did this work. It lives in a research paper called "Thinking with Visual Primitives" — and the reason you may not have heard of it is that it was published, cited, and then quietly pulled. The original repository 404s; what survives is an archived community mirror with the technical report and an MIT license, but no model weights.

The idea is genuinely lovely, and worth knowing even if you can't run it. Most multimodal models reason in pure language — they "see" an image, then think about it in words. The paper's argument is that language is a lossy pointer: in a dense scene, the phrase "the cup behind the other cup" collapses, and the model loses track of which object it meant. They call this the Reference Gap.

Their fix: make spatial markers — points and bounding boxes — into "minimal units of thought," and interleave them directly into the reasoning trace. The model literally points while it reasons. The boxes aren't a final answer formatted at the end; they're scaffolding the model leans on mid-thought to keep itself anchored to physical coordinates.

A few details I found striking:

  • It's built on the DeepSeek-V4-Flash backbone (a 284B-parameter Mixture-of-Experts, ~13B active) plus an in-house vision encoder. So the marketing copy that confused me wasn't wrong about the lineage — it was describing the research model, not the text model that ships under the same name.
  • The efficiency is the headline. A 756×756 image becomes ~324 visual tokens, and a compressed-attention trick squeezes the KV cache down to roughly 81 entries — an overall compression on the order of 7,000×. That's why it can think hard about an image without a giant token bill.
  • On counting, spatial-reasoning, and topological (maze / path-tracing) benchmarks, it reportedly matches or beats GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash — at a fraction of the visual-token budget.

And the output format, which is the part you can actually steal:

<|ref|>cordless drill<|/ref|><|box|>[[x1,y1,x2,y2]]<|/box|>

Boxes are [x1,y1,x2,y2] — top-left and bottom-right — normalized to integers 0–999. Points use <|point|>[[x,y]]<|/point|>. The reasoning follows a fixed protocol: intent analysis → batch grounding (locate everything at once) → summation, with an explicit "faithful refusal" rule so the model says "not present" instead of inventing a box for something it can't see.

The lesson, before the code

Here's the thing I want to underline, because it's the reusable takeaway and it cost me a few hours to internalize:

A capability in a paper is not a capability in an API. A model that exists as weights-to-be-released-later is, for engineering purposes, a model that does not exist. But a method is portable even when the model isn't.

The weights are gone (for now — "to be integrated into the foundation model in the future"). The 284B MoE would be a serious self-hosting project even if they weren't. But the recipe — the box format, the point-while-you-reason protocol, faithful refusal — is just a way of prompting. And that I can run today, on a model that can actually see.

Porting the recipe to Gemini

So I pointed the exact same protocol at Gemini 2.5 Flash (with Pro as the heavyweight option), in a small standalone sandbox, and measured it on real photos.

One gotcha worth saving you: Gemini's native bounding-box format is [ymin, xmin, ymax, xmax] on a 0–1000 scale — the opposite axis order from the paper's [x1,y1,x2,y2]. If you don't pin the convention explicitly in the prompt, you'll get boxes that look plausible and are silently transposed. Say which one you want, every time.

What I found, testing on natural photos:

  • Identification is strong. On the classic two-cats-and-two-remotes test image it returned exactly that — two cats, two remotes — with usable color/category attributes.
  • Faithful refusal works. I asked it to find a dog and a bicycle that weren't in the frame; it returned refusals: ["dog", "bicycle"] and didn't fabricate a single box.
  • Localization is loose, and Gemini sometimes reverts to its own axis order even when told not to. Treat the boxes as approximate regions, not pixel-accurate crops.
  • On busy scenes it over-enumerates. A kitchen photo came back with 28 "items" — including the door, the window, the cabinets, the sink — and the same banana boxed three times.

That last point is the interesting one, because the fix isn't where you'd expect. I assumed I'd need clever box-deduplication math. I didn't. The real problems were relevance (a raw "list everything" prompt grabs the building, not your belongings) and granularity (five separate "orange" rows instead of "oranges ×5"). Two extra lines in the prompt — only include movable belongings, exclude fixtures and architecture; group identical objects with a count — took that kitchen photo from 28 noisy rows to 10 clean inventory items:

refrigerator     appliance
stove            appliance
microwave        appliance
coffee maker     appliance
fruit bowl       kitchenware
orange    ×5     fruit
banana    ×3     fruit
dish soap        cleaning supply
sponge           cleaning supply
chair     ×2     furniture

And because it emitted far fewer tokens, it got cheaper and faster at the same time — roughly a third of the cost and half the latency of the unfiltered version.

One more honest note: the confidence scores the model reports are nearly useless as a gate — almost everything comes back 0.85–0.98, including the items you'd want to reject. So if you build a bulk-capture feature on this, a human review-before-save step is the real safety net, not a confidence threshold.

What I'd tell my Friday-morning self

  • Verify model capabilities empirically, not from descriptions. One image request would have saved me the whole chase. Marketing copy — even technically-flavored copy — is not a contract. The API schema is.
  • Read the error's location, not just its text. A failure at the request-parsing layer means something categorically different from a failure during inference. The first tells you the capability doesn't exist; the second tells you you're holding it wrong.
  • When the model you want isn't shippable, steal its method. The visual-primitives output format and reasoning protocol cost nothing to adopt and made a model I can call behave more usefully.
  • The boring controls beat the clever ones. Relevance filtering and grouping did more for real-world quality than any box-math would have — and saved money doing it.

DeepSeek's vision model is, today, a beautiful paper and a promise. When the weights land it'll be worth revisiting — that token efficiency is no joke. But "later" doesn't ship features. The recipe does, right now, on a model that can actually open its eyes.