Improving Transcription Accuracy in Noisy Meeting Environments

Your transcription pipeline looked great in testing. Clean audio, single speaker, quiet office. Then it hit production: a sales rep calling from a coffee shop, two participants talking over each other, a remote engineer whose network is dropping packets. Accuracy fell off a cliff, and now the meeting summaries are useless.

The question of how to improve transcription accuracy in real-world environments is largely a question of understanding which variables you control and which ones you do not. You cannot control whether a participant uses a headset. You often cannot control background noise. You can control which transcription provider you use, how you configure it, and how you measure accuracy on your specific workload.

This guide covers the main audio quality factors that affect accuracy, how to choose between Deepgram, AssemblyAI, JigsawStack, and native captions for different noise conditions, pre-processing options that help before audio even reaches the model, and how to set up a Word Error Rate (WER) measurement pipeline so you are making evidence-based decisions rather than guessing.

In this guide we go through each layer systematically. Let's get into it.

Audio Quality Factors That Affect Accuracy

Speech recognition models are sensitive to a specific set of audio degradation patterns. Understanding which patterns are present in your workload helps you target interventions precisely.

Background noise is the most common issue. Broadband noise (HVAC, traffic, crowd noise) raises the noise floor and reduces the signal-to-noise ratio (SNR). Most modern neural STT models are trained with noise augmentation and handle moderate SNR degradation (15-25 dB SNR) reasonably well. Below 10 dB SNR, accuracy drops sharply across all providers. The main variables are the participant's microphone quality and their physical environment.

Overlapping speech is harder to handle than background noise. When two voices are simultaneously active on the same audio channel, the model has to decide which sequence of phonemes to decode. Models trained on conversational data handle brief overlaps; extended simultaneous speech causes significant WER increases. The MeetStream per-speaker stream approach via live_audio_required sidesteps this completely because each participant's audio arrives on a separate channel.

OpenAI Whisper API speech-to-text
Using OpenAI Whisper for speech-to-text transcription. Source: Datatas.

Far-field microphones (laptop built-ins, room speakerphones) introduce reverberation and increased background pickup. The audio from a MacBook microphone in a conference room sounds fundamentally different from close-field headset audio. If most of your users join from laptops, your accuracy benchmarks need to reflect that.

VoIP codec compression also matters. Zoom and Teams apply audio compression that can introduce artifacts, especially at lower bitrates. These artifacts are familiar to the models because the training data typically includes compressed audio, but they still contribute to accuracy degradation compared to uncompressed recordings.

Provider Selection Guide for Noisy Environments

Different providers have different noise robustness profiles. This is not a case where one provider is universally better; the right choice depends on your specific audio distribution.

ProviderModelNoise RobustnessBest ForWeaknesses
Deepgramnova-3HighNoisy environments, phone calls, mixed-quality audioSpecialized vocabulary, non-English
AssemblyAIuniversal-2Medium-HighGeneral meetings, good audio quality, multi-speakerVery noisy environments
JigsawStackauto languageMediumMultilingual meetings, language detection neededNoisy audio, heavy accents
Meeting CaptionsPlatform nativeVariesReal-time display, no additional latencyNo export, variable accuracy by platform

Deepgram nova-3 was specifically trained for telephony and noisy real-world audio. It is the recommended starting point for applications where you expect significant background noise or inconsistent microphone quality across your user base. AssemblyAI universal-2 produces excellent results on clean audio and handles multi-speaker well through its own internal diarization, but shows higher WER under heavy noise conditions.

For technical domains (developer tools, engineering meetings, medical, legal), consider that no out-of-the-box model will have excellent coverage of specialized vocabulary. Custom vocabulary or keyword boosting features available on both Deepgram and AssemblyAI can help significantly.

Configuring Recording Settings for Accuracy

The recording_config.transcript field in the MeetStream bot creation request is where you select and configure your transcription provider. Here are configurations optimized for different scenarios:

import requests

# Configuration for noisy, mixed-quality meetings
noisy_config = {
    "meeting_link": "https://meet.google.com/abc-defg-hij",
    "bot_name": "Transcriber",
    "recording_config": {
        "transcript": {
            "provider": "deepgram",
            "model": "nova-3",
            "diarize": True
        }
    }
}

# Configuration for clean audio, high accuracy
clean_config = {
    "meeting_link": "https://meet.google.com/abc-defg-hij",
    "bot_name": "Transcriber",
    "recording_config": {
        "transcript": {
            "provider": "assemblyai",
            "speech_models": ["universal-2"],
            "speaker_labels": True
        }
    }
}

# Configuration for multilingual meetings with language detection
multilingual_config = {
    "meeting_link": "https://meet.google.com/abc-defg-hij",
    "bot_name": "Transcriber",
    "recording_config": {
        "transcript": {
            "provider": "jigsawstack",
            "language": "auto",
            "by_speaker": True
        }
    }
}

response = requests.post(
    "https://api.meetstream.ai/api/v1/bots/create_bot",
    json=noisy_config,
    headers={"Authorization": "Token YOUR_API_KEY"}
)
Speech to text transcription types
Speech-to-text transcription types and applications. Source: BotPenguin.

Pre-Processing: Noise Reduction Before Transcription

If you have access to audio before it goes to the transcription provider, pre-processing can meaningfully improve accuracy. The most impactful interventions are noise gate, normalization, and high-pass filtering.

A noise gate attenuates audio below a threshold amplitude. This suppresses continuous low-level background noise without affecting speech. The risk is cutting the beginning of soft-spoken words if the threshold is too aggressive.

import numpy as np
from scipy.io import wavfile

def apply_noise_gate(
    audio: np.ndarray,
    threshold_db: float = -40.0,
    sample_rate: int = 48000
) -> np.ndarray:
    """
    Apply a simple noise gate. Attenuates frames below threshold_db.
    audio: float32 array, range [-1, 1]
    """
    frame_size = int(sample_rate * 0.02)  # 20ms frames
    output = audio.copy()

    for i in range(0, len(audio) - frame_size, frame_size):
        frame = audio[i:i + frame_size]
        rms = np.sqrt(np.mean(frame ** 2))
        if rms > 0:
            db = 20 * np.log10(rms)
        else:
            db = -96.0

        if db < threshold_db:
            output[i:i + frame_size] *= 0.01  # attenuate to near silence

    return output

def normalize_audio(audio: np.ndarray, target_db: float = -18.0) -> np.ndarray:
    """Normalize peak amplitude to target_db."""
    peak = np.max(np.abs(audio))
    if peak == 0:
        return audio
    target_linear = 10 ** (target_db / 20.0)
    return audio * (target_linear / peak)

High-pass filtering removes low-frequency rumble (HVAC, traffic vibration) below 100Hz that carries no speech information but degrades model performance. A simple first-order Butterworth filter at 80Hz is a low-cost, high-value operation.

from scipy.signal import butter, sosfilt

def highpass_filter(audio: np.ndarray, cutoff_hz: float = 80.0, sample_rate: int = 48000) -> np.ndarray:
    """
    Apply a high-pass filter to remove low-frequency noise.
    """
    sos = butter(4, cutoff_hz, btype='high', fs=sample_rate, output='sos')
    return sosfilt(sos, audio)

Measuring Word Error Rate on Your Workload

The most important practice for improving transcription accuracy is measurement. WER is defined as: (substitutions + deletions + insertions) / total reference words. A WER of 0.05 means 5 percent of words were wrong.

To measure WER on your specific workload, you need a ground truth dataset: a set of meeting recordings with human-verified transcripts. Creating 50 to 100 representative examples is enough to get statistically meaningful results.

Deepgram API playground for transcription
Deepgram API playground for testing transcription accuracy. Source: Deepgram.
def compute_wer(reference: str, hypothesis: str) -> float:
    """
    Compute Word Error Rate between reference and hypothesis strings.
    Uses dynamic programming (edit distance on word sequences).
    """
    ref_words = reference.lower().split()
    hyp_words = hypothesis.lower().split()

    r = len(ref_words)
    h = len(hyp_words)

    # Initialize edit distance matrix
    d = [[0] * (h + 1) for _ in range(r + 1)]

    for i in range(r + 1):
        d[i][0] = i
    for j in range(h + 1):
        d[0][j] = j

    for i in range(1, r + 1):
        for j in range(1, h + 1):
            if ref_words[i - 1] == hyp_words[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = 1 + min(
                    d[i - 1][j],      # deletion
                    d[i][j - 1],      # insertion
                    d[i - 1][j - 1]   # substitution
                )

    if r == 0:
        return 0.0

    return d[r][h] / r

def benchmark_provider(test_cases: list, provider_name: str) -> dict:
    """
    test_cases: list of {reference: str, hypothesis: str}
    """
    wer_scores = [compute_wer(tc["reference"], tc["hypothesis"]) for tc in test_cases]
    return {
        "provider": provider_name,
        "mean_wer": sum(wer_scores) / len(wer_scores),
        "min_wer": min(wer_scores),
        "max_wer": max(wer_scores),
        "sample_count": len(wer_scores)
    }

Segment Your Test Data by Audio Quality

Average WER across all meetings is a misleading metric if your user base has heterogeneous audio quality. Segment your test data by audio quality tier: clean (headset, quiet environment), medium (laptop mic, mild background noise), and degraded (phone call, noisy environment). Report WER separately for each tier.

This segmentation usually reveals that providers that perform similarly on clean audio diverge significantly on the degraded tier. For most SaaS products targeting sales teams and remote workers, the degraded tier represents a disproportionate share of actual usage. Optimize for the tier that matters most to your users, not the one that looks best in a benchmark.

Once you have baseline WER measurements per provider per tier, apply pre-processing and measure again. The combination of provider selection and pre-processing typically reduces WER by 15 to 30 percent on degraded audio compared to using the default provider with no pre-processing.

FAQ

Which transcription provider is best for noise reduction transcription?

Deepgram nova-3 consistently outperforms other providers on noisy, mixed-quality audio. It was trained on telephony data including call center recordings with significant background noise. If your workload includes phone calls, noisy offices, or inconsistent microphone quality across users, start with Deepgram nova-3 and measure WER before trying alternatives. See meeting transcription accuracy for a broader provider comparison.

What is a good Word Error Rate for meeting transcription quality?

For clean audio with a single speaker, state-of-the-art models achieve 3 to 8 percent WER. For multi-speaker meetings with moderate background noise, 10 to 20 percent WER is typical with good providers. Above 25 percent WER, downstream NLP tasks like summarization and action item extraction start producing noticeably poor results. Set your quality threshold based on what downstream tasks need, not on benchmark numbers alone.

Does transcription accuracy tips like custom vocabulary make a real difference?

Yes, significantly for domain-specific vocabulary. Proper nouns, product names, and technical terms have high substitution rates in generic models because they appear infrequently in training data. Both Deepgram (keywords parameter) and AssemblyAI (word_boost parameter) let you provide a list of terms to boost. For a B2B SaaS product where participants frequently reference competitor names, integration names, and internal product terminology, custom vocabulary can reduce domain-specific WER by 40 to 60 percent on those terms.

Can the MeetStream API help with meeting transcription quality by providing better audio?

Yes, indirectly. Using live_audio_required to receive per-speaker streams means each transcription request processes a single clean voice rather than a mixed signal. This alone significantly improves accuracy for multi-speaker meetings because the model handles single-speaker audio much better than mixed audio. For post-call transcription, the MeetStream recording pipeline captures the meeting audio directly from the platform's mix, which is higher quality than screen-recording approaches.