Reading Image-Only PDFs with a Local Vision Model (LM Studio)

While building mcp.sv — a search index over El Salvador's public legal
and fiscal documents — I hit a wall that anyone who has scraped a government site knows well:

Half the "PDFs" are not PDFs at all. They're scans. A photo of a printed page, wrapped
in a PDF container, with no text layer. pdftotext returns an empty string.

You can't search what you can't read. So before any of it could be chunked, embedded, or
served, I needed to turn those page-images back into text — and I wanted to do it locally,
on my own hardware, for three concrete reasons:

Privacy — these are public documents, but the pipeline runs on a private Mac Studio
behind a residential connection. Nothing gets uploaded to a third-party OCR API.
Cost — there are thousands of pages. Per-call cloud OCR pricing adds up fast; local
inference is a one-time model download.
Offline & repeatable — I can re-run the whole ingest without depending on anyone's API
staying up or their pricing staying the same.

This post is the vision-model approach: a multimodal LLM running in LM Studio
does the OCR. (In a companion post I do the same job with Azure Foundry Local and a
different split of labor.)

The architecture in one sentence

LM Studio exposes an OpenAI-compatible server on http://localhost:1234. A vision-capable
model loaded there will accept an image in a chat message and transcribe it. So the whole trick
is: render each PDF page to a PNG, base64-encode it, and POST it to /v1/chat/completions
as an image_url part. Your code talks plain OpenAI JSON — it has no idea the model is on
the same machine.

Here's the config object from the .NET crawler. Note the defaults:

public sealed class LmStudioOptions
{
    /// <summary>Base URL of LM Studio's OpenAI-compatible server. Default: localhost:1234.</summary>
    public string BaseUrl { get; set; } = "http://localhost:1234";

    /// <summary>Embedding model name as loaded in LM Studio. Must produce 768-dim vectors.</summary>
    public string EmbeddingModel { get; set; } = "nomic-embed-text-v1.5";

    /// <summary>Chat / summarization model name as loaded in LM Studio.</summary>
    public string ChatModel { get; set; } = "qwen3-7b";

    /// <summary>Vision-capable model for OCR of scanned PDF pages. Optional — only used by pdf-ingest.</summary>
    public string VisionModel { get; set; } = "qwen/qwen3.6-35b-a3b";

    /// <summary>Max tokens for OCR / vision replies. Reasoning models burn tokens before emitting content.</summary>
    public int VisionMaxTokens { get; set; } = 4000;

    public TimeSpan Timeout { get; set; } = TimeSpan.FromMinutes(10);
}

A few things that bit me are already encoded there: the Timeout is ten minutes (vision
inference is slow), and VisionMaxTokens is generous because reasoning models spend tokens
"thinking" before they emit a single character of the transcription.

Don't OCR what you don't have to

OCR is the expensive path. Most PDFs do have a real text layer, and pulling that out with
pdftotext is instant and lossless. So the first job is a cheap heuristic: extract the text
layer, divide by the page count, and only fall back to OCR if the result is suspiciously thin.

// 2. pdftotext fast path
var pages = RunPdfInfoPageCount(tempPdf);
var textOnly = RunPdfToText(tempPdf);
var charsPerPage = pages > 0 ? textOnly.Length / pages : textOnly.Length;

if (charsPerPage >= MinCharsPerPage)   // default 200
{
    return new PdfExtraction(textOnly.Trim(), fileName, pages, UsedOcr: false, Bytes: bytes);
}

// 3. OCR fallback — render each page, send to the vision model
var ocrSb = new StringBuilder();
for (var page = 1; page <= pages; page++)
{
    ct.ThrowIfCancellationRequested();
    var pageText = await OcrSinglePageAsync(tempPdf, page, ct);
    ocrSb.AppendLine(quot;--- página {page} ---");
    ocrSb.AppendLine(pageText.Trim());
    ocrSb.AppendLine();
}

Under ~200 characters per page, I treat the document as a scan and OCR every page.

Real-world bonus trap: one source (asamblea.gob.sv) prefixes its PDF payloads with a few
garbage bytes before the %PDF- magic number. The ingester sniffs for the header within
the first kilobyte and slices off the junk before handing the file to poppler. Government
data is never clean.

Rendering a page to PNG

This is the boring-but-essential part, and it's just poppler-utils (brew install poppler).
pdftoppm rasterizes one page to a PNG at a chosen DPI:

private async Task<string> OcrSinglePageAsync(string pdfPath, int page, CancellationToken ct)
{
    var stem = Path.Combine(Path.GetTempPath(), quot;pdfingest-page-{Guid.NewGuid():N}");

    // pdftoppm produces: <stem>-<page>.png (zero-padded if total >= 10)
    Run("pdftoppm",
        quot;-png -r {RenderDpi} -f {page} -l {page} \"{pdfPath}\" \"{stem}\"");

    var dir = Path.GetDirectoryName(stem)!;
    var files = Directory.GetFiles(dir, Path.GetFileName(stem) + "-*.png");
    if (files.Length == 0)
        throw new InvalidOperationException(quot;pdftoppm produced no PNG for page {page}.");
    var pngPath = files[0];

    var bytes = await File.ReadAllBytesAsync(pngPath, ct);
    var text = await _ai.OcrPageAsync(
        pngBytes: bytes,
        instruction: "Transcribe ALL visible text from this scanned page, verbatim, in Spanish. " +
                     "Preserve section headings and bullet markers. Output only the transcription, no commentary. " +
                     "If a region is unreadable, write [ilegible] for that fragment.",
        ct: ct);
    return text;
    // ...cleanup of temp PNGs omitted...
}

RenderDpi defaults to 144 — high enough for the model to read body text, low enough to
keep the PNG small. The instruction prompt matters more than people expect: telling the
model to transcribe verbatim, output only the transcription (no "Here is the text you
asked for…"), keep headings, and emit [ilegible] for unreadable regions, all measurably
improved the output quality.

The actual local-model call

And here's the whole reason this works without a vendor SDK. A vision request in the
OpenAI Chat Completions format is just a user message whose content is an array of
parts — a text part and an image_url part. The image goes in as a data: URL with
base64 bytes:

public async Task<string> OcrPageAsync(byte[] pngBytes, string instruction, CancellationToken ct)
{
    var dataUrl = "data:image/png;base64," + Convert.ToBase64String(pngBytes);
    var req = new VisionChatRequest(
        Model: _options.VisionModel,
        MaxTokens: _options.VisionMaxTokens,
        Messages: new[]
        {
            new VisionMessage("user", new VisionPart[]
            {
                new() { Type = "text",      Text = instruction },
                new() { Type = "image_url", ImageUrl = new VisionImageUrl(dataUrl) },
            }),
        });

    using var resp = await _http.PostAsJsonAsync("/v1/chat/completions", req, JsonOpts, ct);
    resp.EnsureSuccessStatusCode();

    var payload = await resp.Content.ReadFromJsonAsync<ChatResponse>(JsonOpts, ct)
        ?? throw new InvalidOperationException("LM Studio returned an empty vision response.");
    if (payload.Choices.Count == 0)
        throw new InvalidOperationException("LM Studio vision response had no choices.");

    return payload.Choices[0].Message.Content ?? string.Empty;
}

The serialized request on the wire is exactly what you'd send to OpenAI:

{
  "model": "qwen/qwen3.6-35b-a3b",
  "max_tokens": 4000,
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Transcribe ALL visible text..." },
        { "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KGgo..." } }
      ]
    }
  ]
}

To make this run you just open LM Studio, load a vision-capable model (I used a Qwen-VL
variant), and hit Developer → Local Server → Start. The model is now answering on
localhost:1234. Point BaseUrl at it and you're done — same code path you'd use against
the real API, minus the API key and the invoice.

The honest catch: it's slow

A 35-billion-parameter vision model is wonderful at reading messy scans, rotated pages, and
bad photocopies — and it is glacially slow: roughly 30 seconds per page, and it would
occasionally stall on 10–28 MB documents. For a handful of tricky pages, fine. For a few
thousand pages, that's a multi-day job.

That's the trade-off that defines the whole space: a vision LLM gives you the best quality
on the hardest pages, but a purpose-built OCR engine like Tesseract does ordinary scans at
~0.3 s/page on CPU. For mcp.sv I ended up using a vision model where it earns its keep and
a classic OCR engine for the bulk — which is exactly the subject of the companion post,
reading image-only PDFs with Azure Foundry Local.

Takeaways

Detect, then OCR. pdftotext first; only rasterize-and-OCR when chars/page is too low.
A local OpenAI-compatible server (LM Studio) means your OCR code is just plain Chat
Completions JSON — base64 PNG in an image_url part. No vendor lock-in, no key, no bill.
Prompt the transcription explicitly: verbatim, transcription-only, keep headings,
[ilegible] for the unreadable bits.
Vision LLMs are accuracy-first, throughput-last. Budget ~30 s/page and a long client
timeout, and reserve them for the pages a cheaper engine can't handle.

Questions, or your own war stories with scanned government PDFs? Find me via the links on the
about page.