Reading Image-Only PDFs with a Local Model: Azure Foundry Local

In the previous post
I extracted text from scanned, image-only PDFs using a vision LLM in LM Studio — render
each page to a PNG, send it to a local OpenAI-compatible endpoint, get the transcription back.
It works, and on the hardest pages it's the best option. But a 35B vision model runs at roughly
30 seconds per page, and mcp.sv has thousands of pages.

This post does the same job with a different local runtime: Azure Foundry Local,
Microsoft's on-device inference engine. The motivation is identical — privacy, cost, and
offline repeatability — but Foundry Local pushes you toward a cleaner, faster split of labor.

First, an honest constraint

Foundry Local ships a curated, ONNX-optimized catalog of ~25+ open models — Phi-4-mini,
Phi-3.5-mini, the Qwen2.5 family, DeepSeek-R1 distills, Mistral, and friends. You can see them
with:

foundry model list
foundry model list --filter task=chat-completion

Notice the task types it filters on: chat-completion and text-generation. As of today,
Foundry Local's on-device catalog is text-only — there's no multimodal/vision model you can
foundry model run to read an image directly. (Phi-4-multimodal and Phi-3.5-vision live in the
cloud Foundry catalog, not the local one.)

So unlike the LM Studio approach, I can't hand a page-image straight to the model. And
honestly? That nudge produced a better pipeline.

The design: separate the pixels from the reasoning

OCR is two jobs we tend to conflate:

Pixels → characters. Turning an image of text into raw text. This is a specialized
computer-vision problem, and a 35B LLM is wildly over-qualified (and slow) for it.
Raw text → clean, structured text. OCR output is noisy — broken hyphenation, mangled
headings, stray artifacts. This is a language problem, and a small local LLM is excellent
at it.

Foundry Local can't do job #1, but it's perfect for job #2. So the pipeline becomes:

pdftoppm -r 200 -png        render each page to an image   (poppler)
        ↓
tesseract -l spa+eng --psm 1   image → raw text            (Tesseract OCR)
        ↓
Foundry Local  /v1/chat/completions   clean + structure + summarize   (Phi-4-mini / Qwen)
        ↓
chunk → embed → store

This is, in fact, the architecture mcp.sv settled on. The standalone ingest script says it
plainly:

The .NET pdf-ingest pipeline falls back to Qwen-vision OCR when pdftotext
yields <80 chars/page. That works but is glacially slow — ~30s/page on the
35B vision model, hanging on 10–28 MB docs. Tesseract is purpose-built for
this: ~0.3s/page on CPU, handles rotation via OSD, great Spanish quality.

Tesseract does the heavy lifting at ~0.3 s/page; the local LLM only sees text.

Step 1: pixels → raw text with Tesseract

Same pdftoppm rasterization as before, then tesseract instead of a vision model. The
--psm 1 mode runs orientation/script detection (OSD), which auto-corrects rotated scans —
a real problem with government archives:

pdftoppm -r 200 -png -f 3 -l 3 decreto.pdf /tmp/page    # → /tmp/page-3.png
tesseract /tmp/page-3.png stdout -l spa+eng --psm 1     # → raw text on stdout

In Python that's a two-line subprocess each:

def ocr_one_page(pdf_path, page_num, work_dir, lang='spa+eng'):
    subprocess.run(
        ['pdftoppm', '-r', '200', '-f', str(page_num), '-l', str(page_num),
         '-png', pdf_path, os.path.join(work_dir, 'page')],
        check=True)
    png = find_rendered_png(work_dir, page_num)   # <stem>-<padded>.png
    out = subprocess.run(
        ['tesseract', png, 'stdout', '-l', lang, '--psm', '1'],
        check=True, capture_output=True, text=True)
    return out.stdout

Run the pages in a thread pool and you OCR a whole document in the time the vision model spent
on a single page.

Step 2: start Foundry Local and find its endpoint

Install it (it's about a 20 MB package) and start a model. Foundry Local downloads the variant
that best fits your hardware — CPU, GPU, or NPU:

# macOS
brew tap microsoft/foundrylocal
brew install foundrylocal

# Windows
winget install Microsoft.FoundryLocal

# start a text model (downloads on first run, then caches)
foundry model run phi-4-mini

The one wrinkle versus LM Studio: Foundry Local assigns a dynamic port each time the service
starts. You don't hardcode localhost:1234 — you ask for the endpoint:

foundry service status   # prints whether it's running + the local endpoint URL

That gives you a base URL like http://localhost:PORT/v1. It's an OpenAI-compatible
endpoint, and locally it needs no API key (Auth = None).

Step 3: raw text → clean text with the local model

Because the endpoint speaks OpenAI, you call it with the stock OpenAI SDK — you just point
base_url at the local service:

from openai import OpenAI

# endpoint from `foundry service status`; key is ignored locally
client = OpenAI(base_url="http://localhost:PORT/v1", api_key="not-needed")

raw = ocr_pdf("decreto.pdf")   # noisy Tesseract output from step 1

resp = client.chat.completions.create(
    model="phi-4-mini",
    messages=[
        {"role": "system", "content":
            "Eres un corrector de OCR. Limpia el texto: une palabras cortadas por guion, "
            "corrige saltos de linea erroneos y conserva los encabezados (Articulo, Capitulo, "
            "Titulo). No inventes contenido ni agregues comentarios. Devuelve solo el texto corregido."},
        {"role": "user", "content": raw},
    ],
)
clean = resp.choices[0].message.content

If you'd rather not chase the dynamic port yourself, the Foundry Local SDK resolves both the
endpoint and the exact model id for you, then hands you a ready chat client. In .NET:

using Microsoft.AI.Foundry.Local;

var config = new Configuration { AppName = "ocr_cleanup" };
await FoundryLocalManager.CreateAsync(config, Utils.GetAppLogger());
var mgr = FoundryLocalManager.Instance;

var catalog = await mgr.GetCatalogAsync();
var model   = await catalog.GetModelAsync("phi-4-mini")
              ?? throw new Exception("Model not found");

await model.DownloadAsync(p => { /* progress */ });
await model.LoadAsync();

var chat = await model.GetChatClientAsync();
var messages = new List<ChatMessage>
{
    new() { Role = "system", Content = "Eres un corrector de OCR... devuelve solo el texto corregido." },
    new() { Role = "user",   Content = rawOcrText },
};
await foreach (var chunk in chat.CompleteChatStreamingAsync(messages, ct))
    Console.Write(chunk.Choices[0].Message.Content);

await model.UnloadAsync();

Now the local model earns its place: it stitches hyphenated words back together, fixes the
line breaks Tesseract scatters through dense legal text, and preserves the Artículo /
Capítulo headings that the downstream chunker keys on — all without a network round-trip and
without a token bill.

LM Studio vs Foundry Local — when I reach for which

	LM Studio	Azure Foundry Local
Vision/image input	✅ yes (Qwen-VL, etc.)	❌ text-only catalog today
Best for	hardest scans, direct image→text	bulk OCR (Tesseract) + text cleanup/structuring
Endpoint	fixed `localhost:1234`	dynamic port (`foundry service status`) or SDK
Runtime	GGUF / llama.cpp-style	ONNX Runtime, picks CPU/GPU/NPU automatically
Model source	anything on Hugging Face	Microsoft-curated, hardware-optimized catalog
Shipping story	great for a dev box	SDK + Windows ML/NPU acceleration, nice for packaged apps

Neither is "better" — they're different tools. LM Studio lets a vision model read the page
directly, which is unbeatable on the worst inputs. Foundry Local gives you a small, fast,
hardware-aware text model and an SDK that resolves the endpoint for you, which is exactly what
you want for the other 95% of pages once Tesseract has done the pixel work.

Takeaways

Foundry Local's local catalog is text-only — don't plan on feeding it an image. Use it
for the language half of OCR, not the vision half.
Split the problem: Tesseract (pdftoppm + --psm 1 for rotation) for pixels → text at
~0.3 s/page; a local Phi/Qwen model for cleanup, structuring, and summarization.
Foundry Local assigns a dynamic port — read it from foundry service status, or let the
Foundry Local SDK resolve the endpoint and model id and stream the response for you.
It's still an OpenAI-compatible endpoint with no key locally, so the OpenAI SDK works
unchanged — same as the LM Studio path, same privacy and cost wins.

Two runtimes, one goal: read the documents nobody bothered to keep as real text — entirely on
your own machine. Thoughts or corrections? I'm on the links on the about page.