Meeting Transcription: How to Get High Accuracy at Scale

Every meeting intelligence product eventually hits the same wall: transcription accuracy that worked fine in a demo fails in production. The sales team's call recordings have background noise. The engineering all-hands has fifteen speakers. The customer in Tokyo has a strong accent. The transcript comes back with enough errors that your summarization model produces nonsense.

Getting meeting transcription right at scale is not a single configuration decision. It is a combination of provider selection, audio pipeline choices, provider configuration per use case, and a measurement discipline that tells you when accuracy is acceptable and when it is not. Companies that treat transcription as a commodity checkbox rarely invest enough in this layer and pay for it in downstream quality.

The good news is that the current generation of transcription providers, including AssemblyAI, Deepgram, and JigsawStack, are genuinely good at their core job on reasonable audio. The problem is "reasonable audio" is a narrower category than most people assume. This guide is about closing the gap between benchmark performance and real-world production performance.

In this guide we cover the full transcription landscape: how each provider is differentiated, what audio quality factors matter most, how to choose between streaming and post-call transcription, and the complete MeetStream workflow for configuring post-call transcription per use case. Let's get into it.

The Meeting Transcription Landscape

The transcription provider market has consolidated around a few well-differentiated options. Understanding their actual strengths, rather than their marketing, is the first step to making good architectural decisions.

Deepgram nova-3 is the current standard for noisy, real-world audio. It was trained with heavy emphasis on telephony and call center data, which means it handles degraded audio, strong accents, and mixed-quality participant audio better than alternatives. Its weakness is specialized vocabulary in technical domains. It supports real-time streaming via nova-2 and has solid multi-language support for European languages.

AssemblyAI universal-2 produces excellent results on clean, multi-speaker recordings. Its speaker labeling system is mature and its utterance-level transcript format is clean to parse. The universal-streaming-english model is competitive for real-time use cases. It underperforms Deepgram on heavily degraded audio but is often preferred for enterprise recordings where participants are on quality headsets.

Deepgram API playground transcription
Deepgram API playground for testing transcription accuracy. Source: Deepgram.

JigsawStack fills a specific niche: language diversity. Its auto-language detection and broad language support make it the best choice for global teams where the meeting language is uncertain or where participants code-switch between languages. Its English accuracy is slightly below Deepgram and AssemblyAI on comparable audio.

Native meeting captions (the meeting_captions provider option) use the platform's own speech recognition. This is available as an option but accuracy varies significantly by platform and is generally lower than the dedicated providers. It is useful for situations where you cannot use an external provider due to data residency requirements.

Provider Comparison at a Glance

ProviderBest ForStreaming SupportSpeaker LabelsLanguage Range
Deepgram nova-3Noisy audio, telephony, accentsYes (nova-2)Yes (diarize:true)European + Asian
AssemblyAI universal-2Clean meetings, enterpriseYes (universal-streaming-english)Yes (speaker_labels:true)European
JigsawStackMultilingual, auto-detectNoYes (by_speaker:true)Wide global range
Meeting CaptionsData residency constraintsYes (native)NoPlatform-dependent

Audio Quality Factors That Determine Accuracy

Provider selection matters less than audio quality for overall accuracy. The best provider in the world cannot reliably transcribe audio with 5 dB signal-to-noise ratio (SNR). Before optimizing your provider configuration, understand what audio quality distribution your users actually produce.

The main quality factors are: microphone type (headset versus laptop versus room speaker), background noise level, speaking distance from microphone, codec compression (Zoom's audio compression versus raw PCM), and participant count (more speakers means more overlap risk).

For a product serving distributed sales teams, the typical audio distribution is: 30 percent on headsets with good SNR, 50 percent on laptop microphones with moderate SNR, 20 percent on mobile devices or in noisy environments with poor SNR. Your accuracy benchmarks need to weight these tiers proportionally to their frequency in your user base.

The single highest-impact improvement available to any developer building on MeetStream is using live_audio_required for per-speaker streams instead of recording the mixed channel. Per-speaker streams eliminate overlap entirely for the transcription step because each utterance contains exactly one voice. See multi-speaker transcription for the implementation.

Selecting Your Provider per Scenario

There is no universally optimal provider. The selection depends on your specific use case, audio characteristics, and language requirements. Here is a decision framework:

Deepgram transcription API accuracy
Deepgram's transcription API: fast, accurate, and scalable. Source: Datatunnel.
  1. If participants use noisy environments, phone audio, or you expect highly variable audio quality: use Deepgram nova-3 with diarize:true
  2. If meetings are internal enterprise calls with good audio and you need utterance-level speaker attribution: use AssemblyAI universal-2 with speaker_labels:true
  3. If your user base spans multiple languages or you cannot predict the meeting language in advance: use JigsawStack with language:"auto" and by_speaker:true
  4. If you need real-time captions displayed to participants with no external provider: use meeting_captions and accept lower accuracy
  5. If accuracy is critical and latency allows: run a test set of 50 recordings through all three providers, measure WER, and pick the winner for your specific audio distribution

Complete API Configuration for Each Provider

Here are production-ready bot creation configurations for each transcription provider scenario:

import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.meetstream.ai/api/v1"

def create_bot(meeting_link: str, transcript_config: dict, bot_name: str = "Transcriber") -> str:
    response = requests.post(
        f"{BASE_URL}/bots/create_bot",
        json={
            "meeting_link": meeting_link,
            "bot_name": bot_name,
            "recording_config": {
                "transcript": transcript_config
            }
        },
        headers={"Authorization": f"Token {API_KEY}"}
    )
    response.raise_for_status()
    return response.json()["bot_id"]

# Configuration A: Deepgram nova-3 with diarization
# Use for: noisy environments, telephony, mixed quality
deepgram_config = {
    "provider": "deepgram",
    "model": "nova-3",
    "diarize": True
}

# Configuration B: AssemblyAI universal-2 with speaker labels
# Use for: clean enterprise meetings, detailed speaker attribution
assemblyai_config = {
    "provider": "assemblyai",
    "speech_models": ["universal-2"],
    "speaker_labels": True
}

# Configuration C: JigsawStack with auto language detection
# Use for: multilingual meetings, global teams
jigsawstack_config = {
    "provider": "jigsawstack",
    "language": "auto",
    "by_speaker": True
}

# Configuration D: Native meeting captions
# Use for: data residency requirements, fallback only
captions_config = {
    "provider": "meeting_captions"
}

# Example usage
bot_id = create_bot(
    meeting_link="https://meet.google.com/abc-defg-hij",
    transcript_config=deepgram_config,
    bot_name="Sales Call Recorder"
)
print(f"Bot created: {bot_id}")

The Full Post-Call Workflow

Understanding the complete lifecycle of a transcription job prevents the most common production mistakes: fetching before the transcript is ready, missing webhook events, and not handling partial transcripts from failed sessions.

from flask import Flask, request, jsonify
import requests
import json
import os

app = Flask(__name__)
API_KEY = os.environ["MEETSTREAM_API_KEY"]

@app.route("/meetstream-webhook", methods=["POST"])
def handle_meetstream_event():
    payload = request.json
    event = payload.get("event")
    bot_id = payload.get("bot_id")

    # Bot lifecycle events
    if event == "bot.joining":
        print(f"[{bot_id}] Bot is joining the meeting")

    elif event == "bot.inmeeting":
        print(f"[{bot_id}] Bot has joined and is now recording")

    elif event == "bot.stopped":
        status = payload.get("bot_status")
        print(f"[{bot_id}] Bot stopped, status: {status}")
        # Do NOT fetch transcript here. Wait for transcription.processed.

    # Media ready events (arrive after bot.stopped)
    elif event == "audio.processed":
        audio_url = payload.get("audio_url")
        print(f"[{bot_id}] Audio recording ready at: {audio_url}")

    elif event == "video.processed":
        video_url = payload.get("video_url")
        print(f"[{bot_id}] Video recording ready at: {video_url}")

    elif event == "transcription.processed":
        transcript_id = payload.get("transcript_id")
        print(f"[{bot_id}] Transcript ready, id: {transcript_id}")
        process_transcript(bot_id, transcript_id)

    return jsonify({"status": "ok"})

def process_transcript(bot_id: str, transcript_id: str) -> None:
    """Fetch and process a completed transcript."""
    resp = requests.get(
        f"https://api.meetstream.ai/api/v1/transcript/{transcript_id}/get_transcript",
        headers={"Authorization": f"Token {API_KEY}"}
    )
    data = resp.json()

    # Build speaker-attributed transcript
    lines = []
    current_speaker = None
    current_words = []

    for word in data.get("words", []):
        speaker = word.get("speaker", word.get("speakerName", "Unknown"))
        if speaker != current_speaker:
            if current_words and current_speaker:
                lines.append(f"{current_speaker}: {' '.join(current_words)}")
            current_speaker = speaker
            current_words = [word.get("word", "")]
        else:
            current_words.append(word.get("word", ""))

    if current_words and current_speaker:
        lines.append(f"{current_speaker}: {' '.join(current_words)}")

    transcript_text = "\
".join(lines)

    # Store or send downstream
    print(f"Processed {len(lines)} speaker turns for bot {bot_id}")
    return transcript_text

Measuring Accuracy in Production

Measuring accurate meeting transcription at scale requires a sampling strategy. You cannot verify every transcript manually. Instead, build a sampling pipeline: randomly select 1 to 3 percent of meetings, have them manually verified against the audio, and compute WER on that sample. Track WER by audio quality tier (clean, medium, degraded) and by provider.

A practical accuracy target for a production meeting transcription product: under 10 percent WER on clean audio, under 20 percent WER on medium-quality audio, and under 35 percent WER on degraded audio. Above these thresholds, downstream NLP tasks (summarization, action item extraction, keyword detection) produce noticeably degraded results.

AssemblyAI Deepgram API comparison
Deepgram vs AssemblyAI transcription accuracy comparison. Source: Gladia.

If your measured WER is above target, the interventions in order of impact are: (1) switch to a better-matched provider for your audio distribution, (2) enable custom vocabulary or keyword boosting for domain-specific terms, (3) implement audio pre-processing (noise gate, high-pass filter) before the audio reaches the provider, and (4) for multi-speaker meetings, move to per-speaker transcription using live_audio_required. See improving transcription accuracy for detailed WER measurement code and pre-processing implementations.

When you reach acceptable WER targets on your production workload, the MeetStream dashboard gives you visibility into meeting volumes and transcription job status to monitor at scale. If you are building a best meeting transcription api comparison for your specific use case, run benchmarks on your own audio rather than published benchmarks, which rarely reflect the audio quality distribution of real production workloads.

FAQ

What is the best meeting transcription API for sales call recordings?

Deepgram nova-3 with diarize:true is the strongest starting point for sales call recordings. Sales calls frequently involve mobile audio, noisy environments, and participants with varied accents. Deepgram's telephony training data gives it an advantage in these conditions. For clean headset calls, AssemblyAI universal-2 is equally competitive. Benchmark both on your own call recordings before committing to either.

How does meeting transcription software compare to building a custom pipeline?

Building on a meeting bot API like MeetStream plus a transcription provider gives you production-ready infrastructure without managing browser automation, audio capture, or codec handling. A custom pipeline built from headless browsers (Puppeteer) and custom audio capture is significantly more fragile, harder to maintain across platform UI changes, and requires handling every edge case that meeting bot APIs have already solved. The total engineering cost of a custom pipeline is typically 3 to 6 months higher than using an API-based approach.

Can I switch transcription providers per meeting based on expected audio quality?

Yes. The recording_config.transcript.provider field is set per bot creation request, so you can dynamically select the provider when you create each bot. Build a routing function that selects the provider based on signals you have at bot creation time: participant count (more speakers suggests Deepgram), meeting type (external customer call versus internal standup), or user preferences. This per-meeting routing is a production pattern used by teams with heterogeneous meeting types.

What accurate meeting transcription results should I expect at scale?

At scale, you will see a distribution of WER across your meeting corpus. Expect roughly: 20 to 30 percent of meetings to transcribe at under 5 percent WER (clean audio, few speakers), 50 to 60 percent between 5 and 20 percent WER, and 10 to 20 percent above 20 percent WER (noisy, many speakers, or non-English audio). The goal is to shift the distribution leftward through provider optimization and audio quality improvement, not to eliminate the high-WER tail entirely.

How does the MeetStream API handle transcription for Zoom versus Google Meet?

The MeetStream API abstracts platform differences: you provide a meeting link and the bot handles platform-specific behavior. For Zoom, the bot requires an approved Zoom App Marketplace application; for Google Meet and Teams, no additional approval is needed. Transcription configuration is identical across platforms. Audio quality does differ: Zoom's audio codec has higher compression than Google Meet in some configurations, which can slightly affect transcription accuracy on Zoom calls. See the MeetStream platform documentation for platform-specific setup requirements.