Jun 2, 20264 min read/2026/06/02/local-ai-studio-part-3-flux-sdxl-and-the-fp8-gguf-myth/

Local AI Studio — Part 3: FLUX, SDXL, and the fp8-vs-GGUF Myth

This is Part 3, and it's the one where I was wrong in public, which are usually the most
useful posts to write.

In Part 2 we turned
generation into a function. Now let's point it at two very different image models — SDXL
and FLUX.1-dev — and actually time them on the Mac.

All numbers below are at 1024×1024 on the M1 Max, warm (model already loaded, so we
measure compute, not disk). Cold runs add model-load time on top.

SDXL: the fast workhorse

SDXL, 1024px, 25 steps, dpmpp_2m / karras  →  ~60s warm

About a minute for a finished 1-megapixel image. Drop to 15–20 steps or 768px and you're
near 30 seconds. For iterating on ideas — and for the decorative blog art in this series —
SDXL is exactly right: quick, predictable, and the older architecture is very forgiving.

FLUX: gorgeous, and slow

FLUX.1-dev is the current quality leader for open local
images, and it shows. But it's a 12-billion-parameter model, and on the M1 Max that weight
is felt:

FLUX.1-dev (fp8), 1024px, 20 steps  →  ~313s warm   (≈ 16 s/step)

Roughly five minutes per image. Beautiful results — see the jaguar in this series'
materials — but not something you iterate on casually.

The "obvious" fix that wasn't

Here is the received wisdom you'll read everywhere about FLUX on a Mac:

"fp8 has no hardware acceleration on Apple Silicon, so it gets emulated and runs slow.
Switch to a GGUF quantization and it'll speed up."

That's a very plausible story. fp8 really isn't natively accelerated on the M-series GPU,
so the reasoning sounds airtight. I believed it. So I did the work: installed the
ComfyUI-GGUF custom node, downloaded the
Q8_0 quant of FLUX.1-dev (~12 GB), built a parallel workflow with UnetLoaderGGUF, and
benchmarked it head-to-head against fp8.

FLUX.1-dev fp8     1024px / 20 steps  →  313s warm
FLUX.1-dev GGUF-Q8 1024px / 20 steps  →  333s warm

They're tied. The GGUF version was, if anything, a hair slower. Image quality was
identical — Q8 is essentially lossless — but the speed I went chasing simply wasn't there.

Why the myth fails here

The fp8-is-slow story assumes the bottleneck is the numeric format. On this machine it
isn't — the bottleneck is raw GPU compute. A 12B-parameter model doing 20 denoising
steps at a megapixel is just a lot of math, and the M1 Max's GPU works through it at roughly
16 seconds per step regardless of how the weights are stored. fp8, GGUF-Q8, full fp16 —
they all land in the same place, because none of them reduce the amount of arithmetic. GGUF
buys you memory savings, not speed. On a 64 GB machine, I didn't need the memory.

The lesson isn't "GGUF is bad." It's: measure on your own hardware before you believe a speedup. The advice was probably true on the NVIDIA card it was written for.

So how do you make FLUX fast on a Mac?

The levers that actually move the needle attack the compute, not the format:

  • Fewer steps. This is the big one. FLUX.1-schnell is a step-distilled model that
    produces good images in ~4 steps instead of 20 — roughly a 4× speedup, landing FLUX
    near 80 seconds.
  • Lower resolution. Pixels are quadratic. 768px is about 40% less work than 1024px.
  • Just use SDXL for anything where you're iterating, and save FLUX for finals.

My actual workflow

After all this, here's how I split the two in practice:

Need Model Why
Fast iteration, drafts, decorative art SDXL ~60s, forgiving, good enough
Final hero image, photoreal, fine detail FLUX best quality, worth the 5 minutes
Fast and high quality FLUX.1-schnell the compromise, ~80s

The featured images in this series are all SDXL. The house look I'm going for — flat,
outlined, Cubist folk-art with a saturated Central American palette — is decorative rather
than photoreal, and SDXL renders it beautifully. Just as importantly, the ~60-second turnaround
lets me regenerate until the composition feels right, which matters a lot more for art
direction than the last few percent of fidelity would.

In Part 4 we leave still
images behind and ask the bigger question: can this Mac generate video? (Yes — with one
MPS gotcha that produces pure rainbow garbage until you fix it.)