Reading Image-Only PDFs with a Local Vision Model (LM Studio)

While building mcp.sv — a search index over El Salvador's public legal
and fiscal documents — I hit a wall that anyone who has scraped a government site knows well:
Half the "PDFs" are not PDFs at all. They're scans. A photo of a printed page, wrapped
in a PDF container, with no text layer.pdftotextreturns an empty string.
You can't search what you can't read. So before any of it could be chunked, embedded, or
served, I needed to turn those page-images back into text — and I wanted to do it locally,
on my own hardware, for three concrete reasons:
- Privacy — these are public documents, but the pipeline runs on a private Mac Studio
behind a residential connection. Nothing gets uploaded to a third-party OCR API. - Cost — there are thousands of pages. Per-call cloud OCR pricing adds up fast; local
inference is a one-time model download. - Offline & repeatable — I can re-run the whole ingest without depending on anyone's API
staying up or their pricing staying the same.
This post is the vision-model approach: a multimodal LLM running in LM Studio
does the OCR. (In a companion post I do the same job with Azure Foundry Local and a
different split of labor.)
The architecture in one sentence
LM Studio exposes an OpenAI-compatible server on http://localhost:1234. A vision-capable
model loaded there will accept an image in a chat message and transcribe it. So the whole trick
is: render each PDF page to a PNG, base64-encode it, and POST it to /v1/chat/completions
as an image_url part. Your code talks plain OpenAI JSON — it has no idea the model is on
the same machine.
Here's the config object from the .NET crawler. Note the defaults:
public sealed class LmStudioOptions
{
/// <summary>Base URL of LM Studio's OpenAI-compatible server. Default: localhost:1234.</summary>
public string BaseUrl { get; set; } = "http://localhost:1234";
/// <summary>Embedding model name as loaded in LM Studio. Must produce 768-dim vectors.</summary>
public string EmbeddingModel { get; set; } = "nomic-embed-text-v1.5";
/// <summary>Chat / summarization model name as loaded in LM Studio.</summary>
public string ChatModel { get; set; } = "qwen3-7b";
/// <summary>Vision-capable model for OCR of scanned PDF pages. Optional — only used by pdf-ingest.</summary>
public string VisionModel { get; set; } = "qwen/qwen3.6-35b-a3b";
/// <summary>Max tokens for OCR / vision replies. Reasoning models burn tokens before emitting content.</summary>
public int VisionMaxTokens { get; set; } = 4000;
public TimeSpan Timeout { get; set; } = TimeSpan.FromMinutes(10);
}
A few things that bit me are already encoded there: the Timeout is ten minutes (vision
inference is slow), and VisionMaxTokens is generous because reasoning models spend tokens
"thinking" before they emit a single character of the transcription.
Don't OCR what you don't have to
OCR is the expensive path. Most PDFs do have a real text layer, and pulling that out with
pdftotext is instant and lossless. So the first job is a cheap heuristic: extract the text
layer, divide by the page count, and only fall back to OCR if the result is suspiciously thin.
// 2. pdftotext fast path
var pages = RunPdfInfoPageCount(tempPdf);
var textOnly = RunPdfToText(tempPdf);
var charsPerPage = pages > 0 ? textOnly.Length / pages : textOnly.Length;
if (charsPerPage >= MinCharsPerPage) // default 200
{
return new PdfExtraction(textOnly.Trim(), fileName, pages, UsedOcr: false, Bytes: bytes);
}
// 3. OCR fallback — render each page, send to the vision model
var ocrSb = new StringBuilder();
for (var page = 1; page <= pages; page++)
{
ct.ThrowIfCancellationRequested();
var pageText = await OcrSinglePageAsync(tempPdf, page, ct);
ocrSb.AppendLine(quot;--- página {page} ---");
ocrSb.AppendLine(pageText.Trim());
ocrSb.AppendLine();
}
Under ~200 characters per page, I treat the document as a scan and OCR every page.
Real-world bonus trap: one source (
asamblea.gob.sv) prefixes its PDF payloads with a few
garbage bytes before the%PDF-magic number. The ingester sniffs for the header within
the first kilobyte and slices off the junk before handing the file to poppler. Government
data is never clean.
Rendering a page to PNG
This is the boring-but-essential part, and it's just poppler-utils (brew install poppler).
pdftoppm rasterizes one page to a PNG at a chosen DPI:
private async Task<string> OcrSinglePageAsync(string pdfPath, int page, CancellationToken ct)
{
var stem = Path.Combine(Path.GetTempPath(), quot;pdfingest-page-{Guid.NewGuid():N}");
// pdftoppm produces: <stem>-<page>.png (zero-padded if total >= 10)
Run("pdftoppm",
quot;-png -r {RenderDpi} -f {page} -l {page} \"{pdfPath}\" \"{stem}\"");
var dir = Path.GetDirectoryName(stem)!;
var files = Directory.GetFiles(dir, Path.GetFileName(stem) + "-*.png");
if (files.Length == 0)
throw new InvalidOperationException(quot;pdftoppm produced no PNG for page {page}.");
var pngPath = files[0];
var bytes = await File.ReadAllBytesAsync(pngPath, ct);
var text = await _ai.OcrPageAsync(
pngBytes: bytes,
instruction: "Transcribe ALL visible text from this scanned page, verbatim, in Spanish. " +
"Preserve section headings and bullet markers. Output only the transcription, no commentary. " +
"If a region is unreadable, write [ilegible] for that fragment.",
ct: ct);
return text;
// ...cleanup of temp PNGs omitted...
}
RenderDpi defaults to 144 — high enough for the model to read body text, low enough to
keep the PNG small. The instruction prompt matters more than people expect: telling the
model to transcribe verbatim, output only the transcription (no "Here is the text you
asked for…"), keep headings, and emit [ilegible] for unreadable regions, all measurably
improved the output quality.
The actual local-model call
And here's the whole reason this works without a vendor SDK. A vision request in the
OpenAI Chat Completions format is just a user message whose content is an array of
parts — a text part and an image_url part. The image goes in as a data: URL with
base64 bytes:
public async Task<string> OcrPageAsync(byte[] pngBytes, string instruction, CancellationToken ct)
{
var dataUrl = "data:image/png;base64," + Convert.ToBase64String(pngBytes);
var req = new VisionChatRequest(
Model: _options.VisionModel,
MaxTokens: _options.VisionMaxTokens,
Messages: new[]
{
new VisionMessage("user", new VisionPart[]
{
new() { Type = "text", Text = instruction },
new() { Type = "image_url", ImageUrl = new VisionImageUrl(dataUrl) },
}),
});
using var resp = await _http.PostAsJsonAsync("/v1/chat/completions", req, JsonOpts, ct);
resp.EnsureSuccessStatusCode();
var payload = await resp.Content.ReadFromJsonAsync<ChatResponse>(JsonOpts, ct)
?? throw new InvalidOperationException("LM Studio returned an empty vision response.");
if (payload.Choices.Count == 0)
throw new InvalidOperationException("LM Studio vision response had no choices.");
return payload.Choices[0].Message.Content ?? string.Empty;
}
The serialized request on the wire is exactly what you'd send to OpenAI:
{
"model": "qwen/qwen3.6-35b-a3b",
"max_tokens": 4000,
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Transcribe ALL visible text..." },
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KGgo..." } }
]
}
]
}
To make this run you just open LM Studio, load a vision-capable model (I used a Qwen-VL
variant), and hit Developer → Local Server → Start. The model is now answering on
localhost:1234. Point BaseUrl at it and you're done — same code path you'd use against
the real API, minus the API key and the invoice.
The honest catch: it's slow
A 35-billion-parameter vision model is wonderful at reading messy scans, rotated pages, and
bad photocopies — and it is glacially slow: roughly 30 seconds per page, and it would
occasionally stall on 10–28 MB documents. For a handful of tricky pages, fine. For a few
thousand pages, that's a multi-day job.
That's the trade-off that defines the whole space: a vision LLM gives you the best quality
on the hardest pages, but a purpose-built OCR engine like Tesseract does ordinary scans at
~0.3 s/page on CPU. For mcp.sv I ended up using a vision model where it earns its keep and
a classic OCR engine for the bulk — which is exactly the subject of the companion post,
reading image-only PDFs with Azure Foundry Local.
Takeaways
- Detect, then OCR.
pdftotextfirst; only rasterize-and-OCR when chars/page is too low. - A local OpenAI-compatible server (LM Studio) means your OCR code is just plain Chat
Completions JSON — base64 PNG in animage_urlpart. No vendor lock-in, no key, no bill. - Prompt the transcription explicitly: verbatim, transcription-only, keep headings,
[ilegible]for the unreadable bits. - Vision LLMs are accuracy-first, throughput-last. Budget ~30 s/page and a long client
timeout, and reserve them for the pages a cheaper engine can't handle.
Questions, or your own war stories with scanned government PDFs? Find me via the links on the
about page.