GOLIVE
Back to blog

Integrating STT? Compare the Hidden Cost

The per-minute price listed by speech recognition APIs only tells part of the story. Here's what integrating STT into your product really costs.

Google, Azure, AssemblyAI, Whisper: compare the true costs of speech recognition APIs. Per-minute pricing, engineering, infrastructure, hidden pitfalls.

You're looking for a speech recognition API for your product. You compare per-minute prices, find offers at $0.01 per minute, and figure the budget is under control. Except that the listed price never reflects the real cost. Between integration engineering, accuracy tuning on your domain-specific audio, and ongoing maintenance, the final bill can be five to ten times what you originally budgeted.

I've guided several product teams through this decision. The finding is always the same: the "speech-to-text API" line item in the initial budget is systematically underestimated.

  • 💰 Real price gap: the advertised per-minute cost hides 60 to 80% of the total spend.
  • ⚠️ Invisible engineering: integration, tuning, and edge-case handling eat up weeks.
  • 🔧 Cloud vs self-hosted: Whisper is free, but GPU infrastructure is not.
  • 🎯 Decision framework: five concrete criteria to choose without blowing your budget.

What a minute of transcription actually costs

The natural reflex when evaluating an STT API is to open the pricing page and compare per-minute rates. The numbers look reasonable: a few cents, sometimes less.

What are the real per-minute prices of the leading APIs?

According to OpenReplay's comparison of speech recognition engines in 2025, rates break down as follows: Google Cloud Speech-to-Text charges between $0.016 and $0.024 per minute depending on the model. Amazon Transcribe sits at $0.024 per minute. Azure Speech to Text offers roughly $1 per hour of audio, or $0.017 per minute for the standard model. IBM Watson drops to $0.01 per minute after the free tier.

These prices seem negligible. For 10,000 minutes of audio per month (a common volume for a B2B transcription app), the pure API bill lands between $100 and $240.

The problem is that figure only tells 20% of the story.

Why the listed price isn't enough for budgeting

Take OpenAI's real-time API. According to an analysis by Seasalt.ai, the advertised rate suggests roughly $0.30 per minute (audio input + output). Their actual test measured a cost of $1 per minute, more than three times the listed price. The reason: tokens generated behind the scenes (context, reasoning, rephrasing) that inflate consumption without the developer seeing them clearly in the documentation.

This kind of discrepancy is not an isolated case. Each API has its own billing rules (per 15-second increment, per request, per feature enabled), and diarization, sentiment analysis, or language detection are often paid add-ons on top of the base rate.

Provider Price / min (base) Free tier Diarization Trend
Google Cloud STT $0.016 to $0.024 60 min/month + $300 credits Yes (included) → stable
Amazon Transcribe $0.024 60 min/month (1 year) Yes (included) → stable
Azure Speech ~$0.017 5 h/month Yes (add-on) → stable
AssemblyAI ~$0.015 $50 in credits Yes (included) ↑ strong adoption
Grok Voice Agent $0.05 No N/A (conversational) ↑ new entrant
OpenAI Realtime ~$1.00 (measured) No No ↓ prohibitive cost

SOURCE: OpenReplay, Seasalt.ai, official documentation · Updated 05/2026

The hidden costs nobody budgets for

The API bill is the visible part. The real costs lie in everything surrounding the API call, and that's where budgets blow up.

How much engineering time should you plan for?

Integrating an STT API into an existing product takes far more than a simple REST call. You need to handle audio streaming (WebSocket for real-time), buffering, reconnection on network drops, result formatting, language management, and transcript storage.

Plan for two to six weeks of development for a production-ready integration, depending on the complexity of your stack. This isn't an abstract estimate: it's what I see firsthand with clients at GoLive Software who are adding voice features to their SaaS products.

For a team billing at €500 per day, six weeks of integration comes to €15,000. Compare that to the $200 monthly API invoice.

Which edge cases derail the budget?

The accuracy advertised by providers (often 95%+) is measured on clean audio, in English, with a single speaker. Your domain audio is rarely that cooperative.

Background noise in a workshop, regional accents, technical vocabulary (product names, industry acronyms), crosstalk: each special case demands specific tuning. With Google and Azure, this means Custom Speech models trained on your own datasets. With AssemblyAI, custom vocabularies help but don't cover everything.

The time spent building those datasets, measuring the Word Error Rate on your own recordings, and iterating until accuracy reaches an acceptable level is engineering budget that nobody writes into the initial estimate.

There's also the cost of ongoing maintenance. APIs evolve, models change, quotas get modified. Every update requires regression testing. This is not a "fire and forget" situation.

Managed cloud vs open source: the false dilemma of "free"

The temptation is strong to turn to open source to eliminate the API bill. OpenAI's Whisper, in particular, offers impressive accuracy and supports multiple languages. It's free, and it can be self-hosted.

Why "free" Whisper can cost more than an API

The Whisper large model requires a dedicated GPU to run at usable speed. A cloud GPU instance (A10G-class on AWS) costs between $0.75 and $1.50 per hour. If your app processes audio continuously, that GPU runs around the clock. At $1 per hour, 24/7, you're looking at $720 per month before even counting infrastructure maintenance.

Compared to a cloud API at $0.02 per minute, self-hosting Whisper only becomes cost-effective above 36,000 minutes of audio per month. Below that threshold, you're paying more for a service you have to maintain yourself.

Other open-source options exist. Kaldi remains a reference in research, but deployment is complex. SpeechBrain (PyTorch) integrates well with HuggingFace but requires heavy customization. Mozilla's DeepSpeech is no longer maintained. As AssemblyAI's guide on free STT APIs points out, open source is a fit for teams with dedicated ML engineers and strict data-privacy constraints. For everyone else, a managed API remains the pragmatic choice.

NVIDIA recently launched PersonalPlex, an open-source conversational model with 36-millisecond latency. Alibaba offers Qwen3 TTS (1.7 billion parameters), capable of voice cloning across nine languages. These models open up interesting possibilities for self-hosting, but they remain building blocks to assemble: no production-ready pipeline, no monitoring, no SLA.

"The real cost of a voice API isn't the per-minute price. It's the engineering time to make it work on your audio, with your stack, at your scale."

Vincent Roye, May 2026

How to choose without blowing your budget

Choosing a speech recognition API comes down to five criteria you should evaluate in this exact order.

Which criteria should you prioritize based on your context?

First, accuracy on your real audio. Not marketing benchmarks. Take 50 recordings representative of your use case and measure the Word Error Rate on each provider. That's the only metric that matters.

Second, time-to-production. With the market projected to reach $60 billion by 2032, the race to ship voice features is on. If you don't have an ML engineer on the team, self-hosting will slow you down by several months. AssemblyAI and Deepgram are betting heavily on developer experience: clean SDKs, clear documentation, copy-paste examples. Google and AWS are more powerful but take longer to configure.

Third, total cost of ownership. API price + integration engineering + maintenance + infrastructure (if self-hosted). A tool that's slightly more expensive per minute can end up cheaper if it saves you dozens of engineering hours every month.

Fourth, scalability. Check concurrency limits, regional quotas, and uptime guarantees. For real-time captioning, geographic latency becomes critical.

Fifth, data privacy. If your audio contains sensitive data (medical, legal, financial), self-hosting or a provider with dedicated hosting may be a regulatory requirement, not a technical choice.

Should you outsource STT integration?

That's the question I systematically raise with the teams I work with. Integrating a voice component into an existing SaaS touches backend (streaming, storage, processing), frontend (recording UI, real-time display), and infrastructure (scaling, monitoring) all at once.

A team of specialized offshore developers, familiar with these audio pipelines and equipped with AI tools to accelerate integration, can cut time-to-production by two to three times. The cost is structurally lower than a local team, and the technical quality holds up perfectly when the team is well selected.

I see it regularly: a small, senior, well-tooled team ships faster than a large team discovering the subject for the first time. This is even more true when devs use AI agents to speed up prototyping and debugging of audio pipelines.

The most common mistake is treating STT integration as a "small module to plug in." It's a full project in its own right, with its own technical risks. As with any outsourcing decision, AI doesn't replace human expertise: it amplifies it when the team knows what it's doing.

The verdict

The speech recognition API market has never offered so many options, and per-minute prices have never been this low. That's precisely what makes the trap so dangerous: a $0.01-per-minute rate makes you think the voice feature is practically free, when the real entry ticket is measured in weeks of engineering and ongoing maintenance.

Start by testing two or three APIs on your own audio (AssemblyAI offers $50 in credits, Google and AWS have free tiers). Measure the real WER. Budget the integration as a project, not as a plug-in. And if your team has no experience with audio pipelines, outsource to people who do, rather than discovering the edge cases in production.

Frequently asked questions

What is the cheapest speech recognition API in 2026?

In raw per-minute pricing, IBM Watson ($0.01/min) and AssemblyAI (~$0.015/min) are the cheapest among cloud APIs. OpenAI's Whisper is free to use but requires a GPU for self-hosting, which generates infrastructure costs. The "cheapest" always depends on your volume: below 10,000 minutes per month, managed cloud remains more economical than self-hosting.

Can you use Whisper in production without a GPU?

Whisper offers lighter models (tiny, base, small) that run on CPU, but transcription speed drops dramatically. The "small" model processes audio at roughly 0.3x real-time on a modern CPU, meaning one minute of audio takes over three minutes to transcribe. For non-urgent batch processing, that's acceptable. For real-time or high-volume use, a GPU remains essential.

How long does it take to integrate an STT API into an existing SaaS?

Plan for two to six weeks for a full production integration. The first week covers prototyping and accuracy testing. The following weeks are spent on real-time streaming, error handling, domain vocabulary tuning, and load testing. This timeline assumes a team that has already worked with audio APIs. Without that experience, expect to double it.

Is diarization (speaker identification) always included?

No. Google Cloud STT, Amazon Transcribe, and AssemblyAI include diarization in their standard pricing. Azure offers it as a paid add-on. Conversational APIs like Grok Voice Agent don't perform diarization in the traditional sense (they manage an agent/user dialogue). For open-source solutions, diarization requires a separate pipeline (pyannote.audio is the reference), which adds integration complexity.

Does STT API pricing change with the number of supported languages?

For most cloud providers, the per-minute rate stays the same regardless of language. Google charges the same price for its 125+ languages. The difference lies in accuracy: models are optimized for English, and Word Error Rate increases significantly for less-represented languages. If your product targets French, test French accuracy specifically before committing.

Vidéos YouTube

Articles & ressources

Vincent Roye
Vincent Roye
CEO & Founder, GoLive Software

French engineer based in Vietnam since 2014. He leads a team of senior full-stack developers and has helped startups and SMEs structure their tech teams for over 11 years.