Speech-to-Text in React Native: Four Ways to Reach Android Parity

My React Native app already transcribes voice on iOS. That part was easy — Apple ships SFSpeechRecognizer, you ask for permission, you get text back. Done. Then I went to do the same thing on Android and hit the thing every cross-platform developer eventually hits: the easy path exists on both sides, but it's not the same easy path, and the tradeoffs don't line up.

This is a decision post, not a tutorial. I did the homework on which route reaches Android parity with the least pain, and I want to save you the afternoon.

First, a correction to my own assumption

My starting reflex was "Android doesn't have a drop-in like iOS does." That's wrong, and it's worth correcting because it changes the whole shape of the problem.

Android does have a speech API — android.speech.SpeechRecognizer — it's just been historically awkward: dependent on the Google app, online-only on older versions, and inconsistent across OEMs. The good news is that since Android 13 there's createOnDeviceSpeechRecognizer() for genuine offline recognition. So the real question was never "does an API exist." It's "which route gives me the consistency and offline behavior I actually want?" That's a much better question, and it has four answers.

Option 1 — Wrap each platform's own recognizer

The least-work route is @react-native-voice/voice. It bridges SFSpeechRecognizer on iOS and SpeechRecognizer on Android behind a single JS API. Since iOS already runs on the built-in recognizer, this is the natural continuation of what I have.

Pros: minimal code, nothing to bundle, app stays small.
Cons: Android behavior varies by device and OEM; offline depends on whether the user happens to have the language pack installed; you inherit whatever Google decides to ship.

It's great for dictation-style input where a wobble here and there is fine. It's shakier as a core product feature, because "works on my Pixel" is not "works on the cheap Android in someone's pocket in San Salvador."

Option 2 — Ship your own model (the parity play)

If I want Android to behave exactly like iOS, the answer is to stop depending on either platform's recognizer and ship my own: whisper.rn, which binds whisper.cpp for React Native. It runs fully on-device on both platforms, so the behavior is identical — no OEM fragmentation, no "is the language pack there," no surprises. You bundle or download a GGML Whisper model (tiny/base for speed, small for accuracy).

Pros: offline, private, consistent everywhere, full control.
Cons: app size grows with the weights; heavier on the CPU. Perfect for transcribe-a-clip, workable for streaming if you buffer correctly.

This is the route I lean toward, because the whole reason I'm doing cross-platform work is so I don't maintain two different behaviors. Shipping one model is how you actually get one behavior. (It's the same instinct behind running Whisper on a CPU VPS — own the model, own the outcome.)

Option 3 — Vosk, the lighter offline cousin

react-native-vosk is the lighter offline alternative — smaller and friendlier to Android than Whisper. The catch is accuracy: it's a step below Whisper for most languages. If offline matters but you can trade a little precision for a smaller, snappier footprint, it's a real option.

Option 4 — Cloud STT

Deepgram, AssemblyAI, Google Cloud Speech — best-in-class accuracy, zero on-device compute. But it sends audio off the device (privacy), needs a network, and costs per use. And here's the specific trap for my situation: iOS already runs on-device, so using cloud only for Android means two different behaviors and two codepaths to maintain. That's the exact fragmentation I'm trying to kill, reintroduced on purpose.

The two questions that actually decide it

Forget the table for a second. When I stopped comparing libraries and started interrogating my own requirements, the whole thing collapsed to two questions:

Is offline required? Yes → Whisper or Vosk. No → the native recognizer or cloud.
Live streaming or transcribe-after-recording? Captions-as-you-speak → native recognizers or a carefully buffered Whisper. Transcribe-a-clip-on-stop → anything, and Whisper batch is the simplest thing that works.

That's it. Everything else is detail. Here's the same logic as a lookup:

Need	Lean toward
Least code, fastest to ship	`@react-native-voice/voice`
Offline + consistent across platforms	`whisper.rn`
Lightweight offline, accuracy less critical	`react-native-vosk`
Max accuracy, on-device not required	Cloud (Deepgram / AssemblyAI / Google)

The core tradeoff underneath all of it: parity via each platform's own recognizer (less code, less consistency) versus parity via one shared model (more control, offline, consistent, heavier). There's no free lunch — you're picking which cost you'd rather pay.

The web wrinkle nobody warns you about

One aside, because React Native Web bit me here. If you also target the web (RN Web compiles to the DOM), the browser has its own speech story that does not carry over to native:

The Web Speech API (SpeechRecognition / webkitSpeechRecognition) looks like a freebie, but in Chrome it streams your audio to Google's servers. It's not a local model. Fine for a prototype, wrong for anything private.
Transformers.js (@huggingface/transformers) is the real on-device option — actual Whisper via ONNX on WebGPU, WASM fallback, weights downloaded once then fully local. Two things will ruin your day if you skip them: the audio must be mono, 16 kHz, Float32 (resample your getUserMedia output or you get garbage), and Metro/Expo Web can choke on the dynamic WASM and workers, so you self-host the model files and run inference in a Web Worker.

The point is that "it works on web" and "it works on native" are two different projects wearing the same repo. Don't let a browser demo convince you the mobile problem is solved.

Where I landed

For a feature that's core to the product and needs to behave the same whether you're on an iPhone or a five-year-old Android, I'm going with one shared Whisper model (whisper.rn). The app gets bigger and the CPU works harder, and I'll take both, because the alternative is maintaining two personalities and explaining to users why transcription is "just different" on their phone.

If I only needed casual dictation, I'd wrap the native recognizers and move on with my afternoon. That's the honest answer to most of these decisions: the right route depends entirely on whether the thing is a nice-to-have or the product. Answer that first, and the library picks itself.