Reading Image-Only PDFs with a Local Model: Azure Foundry Local

In the previous post
I extracted text from scanned, image-only PDFs using a vision LLM in LM Studio — render
each page to a PNG, send it to a local OpenAI-compatible endpoint, get the transcription back.
It works, and on the hardest pages it's the best option. But a 35B vision model runs at roughly
30 seconds per page, and mcp.sv has thousands of pages.
This post does the same job with a different local runtime: Azure Foundry Local,
Microsoft's on-device inference engine. The motivation is identical — privacy, cost, and
offline repeatability — but Foundry Local pushes you toward a cleaner, faster split of labor.
First, an honest constraint
Foundry Local ships a curated, ONNX-optimized catalog of ~25+ open models — Phi-4-mini,
Phi-3.5-mini, the Qwen2.5 family, DeepSeek-R1 distills, Mistral, and friends. You can see them
with:
foundry model list
foundry model list --filter task=chat-completion
Notice the task types it filters on: chat-completion and text-generation. As of today,
Foundry Local's on-device catalog is text-only — there's no multimodal/vision model you can
foundry model run to read an image directly. (Phi-4-multimodal and Phi-3.5-vision live in the
cloud Foundry catalog, not the local one.)
So unlike the LM Studio approach, I can't hand a page-image straight to the model. And
honestly? That nudge produced a better pipeline.
The design: separate the pixels from the reasoning
OCR is two jobs we tend to conflate:
- Pixels → characters. Turning an image of text into raw text. This is a specialized
computer-vision problem, and a 35B LLM is wildly over-qualified (and slow) for it. - Raw text → clean, structured text. OCR output is noisy — broken hyphenation, mangled
headings, stray artifacts. This is a language problem, and a small local LLM is excellent
at it.
Foundry Local can't do job #1, but it's perfect for job #2. So the pipeline becomes:
pdftoppm -r 200 -png render each page to an image (poppler)
↓
tesseract -l spa+eng --psm 1 image → raw text (Tesseract OCR)
↓
Foundry Local /v1/chat/completions clean + structure + summarize (Phi-4-mini / Qwen)
↓
chunk → embed → store
This is, in fact, the architecture mcp.sv settled on. The standalone ingest script says it
plainly:
The .NET pdf-ingest pipeline falls back to Qwen-vision OCR when pdftotext
yields <80 chars/page. That works but is glacially slow — ~30s/page on the
35B vision model, hanging on 10–28 MB docs. Tesseract is purpose-built for
this: ~0.3s/page on CPU, handles rotation via OSD, great Spanish quality.
Tesseract does the heavy lifting at ~0.3 s/page; the local LLM only sees text.
Step 1: pixels → raw text with Tesseract
Same pdftoppm rasterization as before, then tesseract instead of a vision model. The
--psm 1 mode runs orientation/script detection (OSD), which auto-corrects rotated scans —
a real problem with government archives:
pdftoppm -r 200 -png -f 3 -l 3 decreto.pdf /tmp/page # → /tmp/page-3.png
tesseract /tmp/page-3.png stdout -l spa+eng --psm 1 # → raw text on stdout
In Python that's a two-line subprocess each:
def ocr_one_page(pdf_path, page_num, work_dir, lang='spa+eng'):
subprocess.run(
['pdftoppm', '-r', '200', '-f', str(page_num), '-l', str(page_num),
'-png', pdf_path, os.path.join(work_dir, 'page')],
check=True)
png = find_rendered_png(work_dir, page_num) # <stem>-<padded>.png
out = subprocess.run(
['tesseract', png, 'stdout', '-l', lang, '--psm', '1'],
check=True, capture_output=True, text=True)
return out.stdout
Run the pages in a thread pool and you OCR a whole document in the time the vision model spent
on a single page.
Step 2: start Foundry Local and find its endpoint
Install it (it's about a 20 MB package) and start a model. Foundry Local downloads the variant
that best fits your hardware — CPU, GPU, or NPU:
# macOS
brew tap microsoft/foundrylocal
brew install foundrylocal
# Windows
winget install Microsoft.FoundryLocal
# start a text model (downloads on first run, then caches)
foundry model run phi-4-mini
The one wrinkle versus LM Studio: Foundry Local assigns a dynamic port each time the service
starts. You don't hardcode localhost:1234 — you ask for the endpoint:
foundry service status # prints whether it's running + the local endpoint URL
That gives you a base URL like http://localhost:PORT/v1. It's an OpenAI-compatible
endpoint, and locally it needs no API key (Auth = None).
Step 3: raw text → clean text with the local model
Because the endpoint speaks OpenAI, you call it with the stock OpenAI SDK — you just point
base_url at the local service:
from openai import OpenAI
# endpoint from `foundry service status`; key is ignored locally
client = OpenAI(base_url="http://localhost:PORT/v1", api_key="not-needed")
raw = ocr_pdf("decreto.pdf") # noisy Tesseract output from step 1
resp = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content":
"Eres un corrector de OCR. Limpia el texto: une palabras cortadas por guion, "
"corrige saltos de linea erroneos y conserva los encabezados (Articulo, Capitulo, "
"Titulo). No inventes contenido ni agregues comentarios. Devuelve solo el texto corregido."},
{"role": "user", "content": raw},
],
)
clean = resp.choices[0].message.content
If you'd rather not chase the dynamic port yourself, the Foundry Local SDK resolves both the
endpoint and the exact model id for you, then hands you a ready chat client. In .NET:
using Microsoft.AI.Foundry.Local;
var config = new Configuration { AppName = "ocr_cleanup" };
await FoundryLocalManager.CreateAsync(config, Utils.GetAppLogger());
var mgr = FoundryLocalManager.Instance;
var catalog = await mgr.GetCatalogAsync();
var model = await catalog.GetModelAsync("phi-4-mini")
?? throw new Exception("Model not found");
await model.DownloadAsync(p => { /* progress */ });
await model.LoadAsync();
var chat = await model.GetChatClientAsync();
var messages = new List<ChatMessage>
{
new() { Role = "system", Content = "Eres un corrector de OCR... devuelve solo el texto corregido." },
new() { Role = "user", Content = rawOcrText },
};
await foreach (var chunk in chat.CompleteChatStreamingAsync(messages, ct))
Console.Write(chunk.Choices[0].Message.Content);
await model.UnloadAsync();
Now the local model earns its place: it stitches hyphenated words back together, fixes the
line breaks Tesseract scatters through dense legal text, and preserves the Artículo /
Capítulo headings that the downstream chunker keys on — all without a network round-trip and
without a token bill.
LM Studio vs Foundry Local — when I reach for which
| LM Studio | Azure Foundry Local | |
|---|---|---|
| Vision/image input | ✅ yes (Qwen-VL, etc.) | ❌ text-only catalog today |
| Best for | hardest scans, direct image→text | bulk OCR (Tesseract) + text cleanup/structuring |
| Endpoint | fixed localhost:1234 |
dynamic port (foundry service status) or SDK |
| Runtime | GGUF / llama.cpp-style | ONNX Runtime, picks CPU/GPU/NPU automatically |
| Model source | anything on Hugging Face | Microsoft-curated, hardware-optimized catalog |
| Shipping story | great for a dev box | SDK + Windows ML/NPU acceleration, nice for packaged apps |
Neither is "better" — they're different tools. LM Studio lets a vision model read the page
directly, which is unbeatable on the worst inputs. Foundry Local gives you a small, fast,
hardware-aware text model and an SDK that resolves the endpoint for you, which is exactly what
you want for the other 95% of pages once Tesseract has done the pixel work.
Takeaways
- Foundry Local's local catalog is text-only — don't plan on feeding it an image. Use it
for the language half of OCR, not the vision half. - Split the problem: Tesseract (
pdftoppm+--psm 1for rotation) for pixels → text at
~0.3 s/page; a local Phi/Qwen model for cleanup, structuring, and summarization. - Foundry Local assigns a dynamic port — read it from
foundry service status, or let the
Foundry Local SDK resolve the endpoint and model id and stream the response for you. - It's still an OpenAI-compatible endpoint with no key locally, so the OpenAI SDK works
unchanged — same as the LM Studio path, same privacy and cost wins.
Two runtimes, one goal: read the documents nobody bothered to keep as real text — entirely on
your own machine. Thoughts or corrections? I'm on the links on the about page.