Speech-to-Text API Integration for Meeting Bots

You've got audio out of the meeting. Now you need words. The gap between those two things, raw PCM audio and accurate, speaker-attributed text, is where most meeting intelligence products win or lose on quality. Pick the wrong provider for your use case, configure it incorrectly, or skip speaker diarization, and your downstream features (summaries, action items, coaching signals) are operating on garbage input.

The speech-to-text API landscape for meeting bots is crowded and the marketing claims are not reliable guides to actual performance on meeting audio. Meeting recordings are acoustically hostile: multiple speakers, background noise, domain-specific vocabulary (product names, technical jargon, acronyms), variable audio quality from different microphones, and participants who talk over each other. The provider that scores 95% word error rate on clean podcast audio may score 78% on a noisy sales call with two people talking at once.

The configuration choices matter as much as the provider. Streaming transcription gives you words in near real-time (under 2 seconds of latency), but post-call transcription is typically 15, 25% more accurate because the provider has the full audio context, can do better language model rescoring, and isn't racing against a latency budget. Most meeting products want both: streaming for live features (real-time suggestions, live captions) and post-call for the authoritative transcript.

In this guide, we'll cover the major STT provider options, streaming versus post-call tradeoffs, accuracy factors specific to meeting audio, how MeetStream's recording_config.transcript.provider abstracts provider selection, and working code examples for both approaches. Let's get into it.

STT Provider Comparison for Meeting Audio

MeetStream supports four meeting transcription API providers, each with different accuracy, latency, and cost characteristics:

ProviderModeDiarizationBest For
AssemblyAI (nova-2)Post-call + streamingYes, speaker labelsHigh accuracy, English-heavy content
Deepgram (nova-3)Post-call + streamingYes, diarize paramLow latency, multi-language
JigsawStackPost-callAuto-language detectMultilingual, auto language
meeting_captionsPost-call onlyPlatform-nativeQuick setup, no provider config

meeting_captions uses the platform's built-in captioning (Google Meet's auto-captions, Zoom's closed captions) and requires no STT provider credentials. Accuracy is lower than dedicated STT providers but setup is zero, useful for prototyping.

For production meeting intelligence applications: Deepgram nova-3 for streaming (lowest latency, good accuracy for live features) and AssemblyAI for post-call (best accuracy for the authoritative record). These two in combination cover most use cases.

Deepgram speech-to-text API for meeting bots
Deepgram API playground for integrating speech-to-text in meeting bots. Source: Deepgram.

Post-Call Transcription Configuration

Configure the STT provider in the recording_config.transcript.provider field when creating the bot:

POST https://api.meetstream.ai/api/v1/bots/create_bot
Authorization: Token YOUR_API_KEY
Content-Type: application/json

{
  "meeting_link": "https://meet.google.com/abc-defg-hij",
  "bot_name": "Transcriber",
  "callback_url": "https://yourapp.com/webhooks/meetstream",
  "recording_config": {
    "transcript": {
      "provider": {
        "name": "assemblyai",
        "api_key": "YOUR_ASSEMBLYAI_KEY",
        "speaker_labels": true,
        "language_code": "en_us"
      }
    },
    "retention": {
      "type": "timed",
      "hours": 168
    }
  }
}

For Deepgram nova-3 post-call:

"provider": {
  "name": "deepgram",
  "api_key": "YOUR_DEEPGRAM_KEY",
  "model": "nova-3",
  "diarize": true,
  "smart_format": true
}

When the transcript is ready, MeetStream fires a transcription.processed webhook. Fetch the result from the transcript_url in the payload:

import urllib.request, json, os

def fetch_transcript(transcript_url: str) -> dict:
    req = urllib.request.Request(
        transcript_url,
        headers={"Authorization": f"Token {os.environ['MEETSTREAM_API_KEY']}"}
    )
    with urllib.request.urlopen(req) as resp:
        return json.loads(resp.read())

# Called from your transcription.processed webhook handler
def handle_transcript_ready(event: dict):
    transcript = fetch_transcript(event['transcript_url'])
    # transcript['utterances'] contains speaker-attributed segments
    for utterance in transcript.get('utterances', []):
        speaker = utterance['speaker']
        text = utterance['text']
        start_ms = utterance['start']
        print(f"[{start_ms}ms] {speaker}: {text}")

Streaming Transcription Configuration

Streaming transcription delivers partial and final transcript segments in near real-time via webhook as the meeting progresses. Configure it with live_transcription_required:

POST https://api.meetstream.ai/api/v1/bots/create_bot
Authorization: Token YOUR_API_KEY
Content-Type: application/json

{
  "meeting_link": "https://meet.google.com/abc-defg-hij",
  "bot_name": "LiveTranscriber",
  "live_transcription_required": {
    "webhook_url": "https://yourapp.com/webhooks/live-transcript",
    "provider": {
      "name": "deepgram_streaming",
      "api_key": "YOUR_DEEPGRAM_KEY",
      "model": "nova-3",
      "diarize": true,
      "language": "en-US",
      "interim_results": true
    }
  }
}
OpenAI Whisper post-call transcription API
Using OpenAI Whisper API for post-call transcription. Source: Datatas.

Your webhook handler receives two types of segments: interim (partial, may change) and final (confirmed). Handle them differently, show interim results in UI but only persist final results to your database:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/webhooks/live-transcript', methods=['POST'])
def handle_live_transcript():
    event = request.json
    segment = event.get('segment', {})

    if segment.get('is_final'):
        # Persist to database
        save_transcript_segment(
            bot_id=event['bot_id'],
            speaker_id=segment.get('speaker_id'),
            speaker_name=segment.get('speaker_name'),
            text=segment['text'],
            start_ms=segment['start_ms'],
            end_ms=segment['end_ms']
        )
    else:
        # Push to live UI via WebSocket/SSE
        push_to_live_display(
            bot_id=event['bot_id'],
            text=segment['text'],
            speaker_name=segment.get('speaker_name')
        )

    return jsonify({'received': True}), 200

Accuracy Factors for Meeting Audio

Several factors significantly affect STT integration accuracy on meeting recordings, beyond what provider benchmarks measure:

Speaker diarization quality. Diarization, distinguishing which speaker said what, is harder than transcription. It fails most on: speakers with similar voice characteristics, overlapping speech (both speakers talking simultaneously), short utterances under 2 seconds, and the first 5, 10 seconds of a meeting before the model has enough audio to build speaker profiles. Enable diarization, but expect 5, 10% speaker attribution errors even from the best providers.

Domain-specific vocabulary. Meeting audio often contains product names, company names, acronyms, and technical terms that aren't in standard language models. Both AssemblyAI and Deepgram support custom vocabulary lists. Provide a wordlist of your users' domain terms and watch accuracy on those terms improve significantly:

# AssemblyAI custom vocabulary
"provider": {
  "name": "assemblyai",
  "api_key": "...",
  "word_boost": [
    "MeetStream", "HubSpot", "LangChain", "WebRTC",
    "Deepgram", "AssemblyAI", "diarization", "webhook"
  ],
  "boost_param": "high"
}

Audio quality. Compressed audio (Opus at low bitrate) loses high-frequency phoneme information. Meeting platform audio at default quality settings is Opus at 32, 64 kbps, acceptable for speech but not as good as the 128 kbps used in podcast recordings. You can't change platform compression, but you can enable noise suppression in your capture pipeline if you're processing audio before forwarding.

Speech-to-text API provider comparison
STT API provider comparison for real-time meeting transcription. Source: Gladia.

Language detection. For multilingual teams or international calls, JigsawStack's automatic language detection removes the need to specify the language upfront. For single-language deployments, always specify the language explicitly, automatic detection adds a few hundred milliseconds of latency and occasionally misidentifies the language for short segments.

Streaming vs Post-Call: When to Use Each

Use CaseRecommended ModeReason
Live captions displayStreamingSub-2s latency required
Real-time coaching promptsStreamingMust fire during the conversation
Meeting summary generationPost-callFull context improves accuracy
CRM note creationPost-callAccuracy matters more than speed
Compliance recordingPost-callAuthoritative, timestamped record
Action item extractionPost-call (+ streaming for preview)Both useful at different latencies

How MeetStream Fits

MeetStream's recording_config.transcript.provider field abstracts provider selection entirely. Switch between AssemblyAI, Deepgram nova-3, and JigsawStack by changing a single field, no provider SDK to maintain, no audio forwarding to implement. Both streaming (deepgram_streaming, assemblyai_streaming) and post-call modes are supported. See the transcript configuration reference for all provider options and parameters.

Conclusion

Choosing the right speech-to-text API configuration for meeting bots means understanding three variables: provider accuracy on meeting-specific audio (not clean benchmarks), streaming versus post-call tradeoffs (latency vs accuracy), and the accuracy factors you can control (custom vocabulary, diarization settings, explicit language specification). Deepgram nova-3 for streaming and AssemblyAI for post-call is a solid default combination for most meeting intelligence products. MeetStream's provider abstraction lets you switch between them without changing your capture or processing code. Get started free at meetstream.ai.

Frequently Asked Questions

What is the difference between streaming and post-call speech-to-text for meetings?

Streaming transcription delivers partial and final transcript segments in near real-time (typically under 2 seconds of latency) as the meeting progresses, enabling live features like captions and real-time coaching. Post-call transcription processes the complete audio after the meeting ends, producing a more accurate transcript because the model has full context and no latency constraints, typically 15, 25% better word error rate than streaming on the same audio.

How does speaker diarization work in meeting transcription?

Speaker diarization assigns speaker labels to transcript segments by analyzing voice characteristics, pitch, timbre, speaking cadence, to distinguish different speakers. It doesn't identify speakers by name automatically; you map speaker labels (e.g., SPEAKER_0, SPEAKER_1) to participant names from the meeting platform's participant list. Accuracy degrades when speakers have similar voices, when multiple people speak simultaneously, or on short utterances.

Which speech-to-text provider is most accurate for sales call transcription?

AssemblyAI typically leads for English-language sales calls where accuracy is the priority, especially with custom vocabulary configured for product names and technical terms. Deepgram nova-3 is competitive and significantly faster for streaming use cases. The most reliable approach is to A/B test both on a sample of your actual meeting recordings, provider benchmarks on academic datasets don't reliably predict accuracy on domain-specific business conversations.

How do I handle multilingual meetings in transcription?

JigsawStack's automatic language detection handles multilingual content without requiring you to specify the language upfront. For meetings in a single known language, specify it explicitly in your provider config, automatic detection adds latency and can misidentify short segments. If you need per-utterance language detection (for code-switching within a meeting), check your provider's documentation for word-level language identification features.

What is the speech-to-text Python integration pattern for meeting bots?

The standard speech-to-text Python integration with MeetStream uses the recording_config.transcript.provider API field to configure the provider at bot creation time, then fetches the transcript from the transcript_url in the transcription.processed webhook event using an authenticated GET request. You never call the STT provider API directly, MeetStream handles the provider integration and returns normalized transcript JSON.