Microsoft VibeVoice: Frontier Open-Source Voice AI, and How to Run It Locally

If you've heard an AI-generated "podcast" that actually sounded like two people talking — natural
turn-taking, banter, the occasional bit of background music swelling at the right moment — there's
a decent chance it came from VibeVoice, Microsoft's
open-source voice-AI family. It's genuinely impressive technology, and it's also a useful object
lesson in what happens when a frontier speech model meets the open internet. Let me cover both: what
it is, and how to actually run it on your own machine today.

What VibeVoice is

VibeVoice isn't one model — it's a family of open-source frontier voice-AI models spanning both
text-to-speech and speech recognition:

VibeVoice-TTS (1.5B) — long-form, multi-speaker TTS. Synthesizes up to 90 minutes of
speech in a single pass with up to 4 distinct speakers, English + Chinese (and emergent
cross-lingual). This is the podcast-generator everyone shared. Accepted as an Oral at ICLR 2026.
VibeVoice-Realtime (0.5B), a.k.a. VibeVoice-Streaming — a lightweight real-time TTS with
streaming text input, ~200 ms first-audio latency, and robust long-form generation (~10 minutes,
8k-token context). Single-speaker, English-focused.
VibeVoice-ASR (7B) — the flip side: speech-to-text. One pass over 60 minutes of audio,
producing structured who / when / what transcripts across 50+ languages. It's now shipped in
Hugging Face transformers.

The clever bit: 7.5 Hz tokens + next-token diffusion

The reason VibeVoice can hold a 90-minute conversation together is its continuous speech
tokenizers (acoustic + semantic) running at an ultra-low frame rate of 7.5 Hz. Most neural
audio codecs tokenize at dozens or hundreds of frames per second; at 7.5 Hz the sequence is short
enough that a language model can reason over an hour of audio without drowning in tokens. On top
of that sits a next-token diffusion framework: a Large Language Model (Qwen2.5 is the base)
understands the text and dialogue flow, and a diffusion head generates the high-fidelity acoustic
detail. The real-time 0.5B variant drops the semantic tokenizer and uses an interleaved, windowed
design — encoding incoming text chunks while it keeps generating audio from prior context.

That low-frame-rate-tokenizer-plus-diffusion-head recipe is the whole trick, and it's why the output
is both long and natural.

The part you have to talk about: it got pulled

Here's the honest history, because it directly affects what you can run:

2025-09-05 — "After release, we discovered instances where the tool was used in ways
inconsistent with the stated intent… we have removed the VibeVoice-TTS code from this repository."

High-quality, controllable, multi-speaker speech is exactly the thing bad actors want for
impersonation, fraud, and disinformation. Microsoft disabled the hosted TTS demo and removed the
official TTS install/usage code. To this day the TTS doc simply reads "Installation and Usage:
Disabled due to widespread misuse." The model weights remain on Hugging Face and community forks
carry the inference code, but the official, blessed path for the 1.5B podcast model is gone.

Microsoft then re-focused the public repo on lower-risk pieces: the Realtime-0.5B model (Dec
2025, with embedded/locked voice prompts specifically to mitigate deepfake risk) and the ASR
model (Jan 2026). The whole project is stamped research-and-development only — "We do not
recommend using VibeVoice in commercial or real-world applications without further testing." Treat
it accordingly, and disclose AI-generated audio when you share it.

How to run it locally

So what can you actually run? The clean, officially-supported local path today is VibeVoice-Realtime-0.5B.
Good news for those of us on Apple hardware: Microsoft tested real-time performance on NVIDIA T4
and Mac M4 Pro — it runs on Apple Silicon, not just CUDA.

Realtime-0.5B (the supported path)

Clone the repo and install the streaming extra (the model is only 0.5B, so it's friendly to modest
hardware):

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .[streamingtts]

On NVIDIA, Microsoft recommends their PyTorch container (nvcr.io/nvidia/pytorch:24.07-py3 and
later) and optionally flash-attn. On a Mac you can install into a normal Python 3.10+ venv — no
Docker required.

Generate speech from a text file (weights download from Hugging Face on first run):

python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Carter

Or run the real-time websocket demo — speak as the text streams in, ~200 ms to first audio:

python demo/vibevoice_realtime_demo.py --model_path microsoft/VibeVoice-Realtime-0.5B

Want more voices? There's an optional pack of experimental multilingual speakers (German, French,
Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish) plus extra English styles:

bash demo/download_experimental_voices.sh

That's the whole local setup. If you just want to hear it before installing anything, Microsoft ships
a Colab notebook.

The 1.5B podcast model (use with eyes open)

If you specifically want the 90-minute, 4-speaker podcast generator, the official install is
disabled — but the microsoft/VibeVoice-1.5B
weights are still public and community forks (e.g. vibevoice-community/VibeVoice) preserve the
inference code, where the run looks roughly like:

# community fork — research use only
python demo/inference_from_file.py \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path demo/text_examples/2p_music.txt

The transcript format is just speaker-labelled lines (Speaker 1: … / Speaker 2: …), which is how
you drive the multi-speaker turn-taking. Given why the official path was pulled, only do this for
genuine research/experimentation, keep it off the public internet, and disclose any AI audio you
produce.

Quirks worth knowing

VibeVoice has personality, and the FAQ is refreshingly candid about it:

Spontaneous background music. They deliberately didn't denoise the training data, so the
model is content-aware and will sometimes add BGM on its own — especially after intro words like
"Welcome to…" or if the voice prompt itself has music. They call it "a little easter egg."
No text normalization. It doesn't expand numbers/abbreviations for you — and the realtime model
explicitly can't read code, formulas, or odd symbols, so pre-process those out.
Emergent singing. There's no music in the training set, yet it can (roughly, often off-key)
sing — an emergent ability, more pronounced in the larger model.
Realtime is English-only and single-speaker; the nine extra languages are untested
exploration. And very short inputs (≤3 words) destabilize it.

Why I find it interesting

Two reasons. First, the architecture — the 7.5 Hz continuous tokenizer is the kind of idea that
makes a hard problem (hold an hour of coherent multi-speaker audio together) tractable, and "LLM for
structure + diffusion head for detail" is a pattern worth stealing in other modalities. Second, it's
a clean example of the local-AI thesis I keep coming back to on this blog: a 0.5B model that does
real-time speech on a laptop or a Mac, fully offline, is genuinely useful — and it pairs naturally
with the local image/music generation I've written about. A voice for your locally-generated reel,
generated locally too.

Just remember which model you're holding and why the big one got pulled. Synthetic voice is a power
tool; use it like one — research-first, disclosed, and never to impersonate.

Tried VibeVoice on your own hardware, or got it running on Apple Silicon? Tell me how it went via the
links on the about page.