Real-Time vs Post-Call Meeting Transcription: Which to Build?
The first question every developer asks when building a meeting transcription product is: do I need results during the meeting or after? The answer determines your entire architecture. These are not two implementations of the same thing. They use different API parameters, different webhook flows, different accuracy characteristics, and different cost models.
Real-time transcription gives you text as speech happens. You can display live captions, trigger in-meeting agents, coach sales reps mid-call, or detect keywords that change the meeting flow. The latency is typically 1 to 4 seconds from speech to text.
Post-call transcription runs after the meeting ends. You get higher accuracy because the model processes the complete audio with full context, no streaming constraints, and a deterministic result. You also pay for one transcription job per meeting rather than running inference throughout the entire call.
Neither approach is universally better. The right choice depends on what your application needs to do and when it needs to do it. Most production systems end up using both: real-time for in-meeting features, post-call for the authoritative record that downstream systems rely on.
In this guide we cover the technical implementation of both approaches using the MeetStream API, the accuracy and latency tradeoffs in concrete terms, cost implications, and a decision matrix for common use cases. Let's get into it.
How Real-Time Transcription Works
MeetStream's real-time transcription uses the live_transcription_required parameter. When you include this in a bot creation request, the MeetStream API streams transcription results to your webhook URL as the meeting progresses. Each payload includes incremental text, a word_is_final flag, and an end_of_turn flag that signals utterance boundaries.
Two streaming providers are available. Deepgram Streaming uses the nova-2 model and delivers results with approximately 300 to 500ms latency from speech end to webhook delivery. AssemblyAI Streaming uses the universal-streaming-english model with similar latency characteristics. Both send interim results (partial words as they are being spoken) followed by final results when the word is confirmed.

import requests
# Real-time transcription with Deepgram Streaming
bot_payload = {
"meeting_link": "https://meet.google.com/abc-defg-hij",
"bot_name": "Live Transcriber",
"live_transcription_required": {
"webhook_url": "https://your-server.com/live-transcript",
"provider": "deepgram_streaming",
"model": "nova-2"
}
}
response = requests.post(
"https://api.meetstream.ai/api/v1/bots/create_bot",
json=bot_payload,
headers={"Authorization": "Token YOUR_API_KEY"}
)
bot_id = response.json()["bot_id"]
print(f"Bot created: {bot_id}")
The webhook handler receives payloads throughout the meeting. The end_of_turn field is the key signal for when an utterance is complete:
from flask import Flask, request, jsonify
app = Flask(__name__)
# Buffer for building utterances from streaming words
utterance_buffer = {}
@app.route("/live-transcript", methods=["POST"])
def handle_live_transcript():
payload = request.json
bot_id = payload.get("bot_id")
speaker = payload.get("speakerName", "Unknown")
word_is_final = payload.get("word_is_final", False)
end_of_turn = payload.get("end_of_turn", False)
transcript = payload.get("transcript", "")
new_text = payload.get("new_text", "")
if end_of_turn and word_is_final:
# Complete utterance available
print(f"{speaker}: {transcript}")
# Trigger downstream logic here:
# - CRM activity logging
# - Keyword detection for sales coaching
# - Real-time summarization
utterance_buffer[bot_id] = {}
elif word_is_final and not end_of_turn:
# Word confirmed but utterance still in progress
# Update partial display, detect keywords
pass
else:
# Interim partial word, update live caption display
pass
return jsonify({"status": "ok"})
Real-Time Latency Deep Dive
Understanding the latency components helps you set realistic expectations and debug when latency is higher than expected. The chain is: speech occurs in meeting, audio is captured by MeetStream bot, audio is streamed to transcription provider, transcription model runs, result is posted to your webhook, your server processes it.
Each hop adds latency. In practice, the dominant latency source is the end-of-utterance detection: the streaming model needs to detect that a speaker has stopped speaking before it can finalize the transcript for that utterance. This detection typically waits 200 to 400ms of silence after the last word before triggering. Combined with network hops, total end-to-end latency from speech to your webhook receiving a finalized utterance is typically 1 to 3 seconds under normal conditions.
For applications that need lower latency, work with word_is_final: true events rather than waiting for end_of_turn. You can display finalized words as they arrive and update your UI incrementally rather than waiting for the complete utterance.
How Post-Call Transcription Works
Post-call transcription uses recording_config.transcript. The bot records the meeting, the recording is stored, and transcription runs after the meeting ends. You receive a transcription.processed webhook when the transcript is ready. Then you fetch it from the transcript endpoint.

import requests
# Post-call transcription with AssemblyAI
bot_payload = {
"meeting_link": "https://meet.google.com/abc-defg-hij",
"bot_name": "Meeting Recorder",
"recording_config": {
"transcript": {
"provider": "assemblyai",
"speech_models": ["universal-2"],
"speaker_labels": True
}
}
}
response = requests.post(
"https://api.meetstream.ai/api/v1/bots/create_bot",
json=bot_payload,
headers={"Authorization": "Token YOUR_API_KEY"}
)
bot_id = response.json()["bot_id"]
# Webhook handler for post-call flow
@app.route("/webhook", methods=["POST"])
def handle_webhook():
payload = request.json
event = payload.get("event")
if event == "transcription.processed":
transcript_id = payload.get("transcript_id")
bot_id = payload.get("bot_id")
# Fetch the complete transcript
resp = requests.get(
f"https://api.meetstream.ai/api/v1/transcript/{transcript_id}/get_transcript",
headers={"Authorization": "Token YOUR_API_KEY"}
)
transcript_data = resp.json()
# Process the complete, accurate transcript
process_meeting_record(bot_id, transcript_data)
return jsonify({"status": "ok"})
def process_meeting_record(bot_id: str, transcript_data: dict) -> None:
words = transcript_data.get("words", [])
utterances = transcript_data.get("utterances", [])
print(f"Meeting {bot_id}: {len(words)} words, {len(utterances)} utterances")
Accuracy Comparison
Post-call transcription is consistently more accurate than real-time streaming transcription. The accuracy gap exists for several reasons. Streaming models must process audio in short windows without future context. Post-call models can run bidirectional attention over the full audio. Streaming models trade accuracy for speed by using smaller, faster architectures. Post-call models run their full-size production models with no latency constraint.
In practice, for clean meeting audio, the WER difference between streaming and post-call is typically 2 to 8 percentage points. For noisy or technically complex audio, the gap can be 15 or more percentage points. If the accuracy of the stored transcript matters for downstream tasks like search, CRM population, or compliance, use post-call transcription for the archival record even if you also run streaming for real-time features.
Cost Implications
Streaming transcription runs the entire duration of the meeting. If your meeting is 60 minutes, you are billed for 60 minutes of streaming transcription. Post-call transcription also runs on the full audio, but as a single batch job. The billing rates may differ by provider, but the audio processed is the same either way.
Where cost differences appear in practice: if you run both streaming and post-call transcription simultaneously, you pay roughly double the transcription cost. For products where post-call accuracy is sufficient, running only post-call transcription is the lower-cost option. For products that need both, ensure the real-time features justify the additional cost.
A common optimization: run streaming transcription during the meeting for live features, but do not store the streaming results as the archival record. Wait for the post-call transcript and use that as the canonical version. This gives you in-meeting functionality without sacrificing archival accuracy.

Decision Matrix
| Use Case | Real-Time | Post-Call | Both |
|---|---|---|---|
| Live captions for participants | Required | ||
| Real-time sales coaching | Required | ||
| In-meeting keyword alerts | Required | ||
| Post-meeting summary generation | Preferred | ||
| CRM note population | Preferred | ||
| Compliance archiving | Required | ||
| Action item extraction | Preferred | ||
| Full meeting intelligence product | Yes | ||
| Sales coaching platform | Yes |
Running Both in Parallel
Configure the bot with both live_transcription_required and recording_config.transcript to run both simultaneously. The live stream powers in-meeting features; the post-call transcript becomes the canonical record.
# Running real-time and post-call transcription simultaneously
bot_payload = {
"meeting_link": "https://meet.google.com/abc-defg-hij",
"bot_name": "Full Intelligence Bot",
"live_transcription_required": {
"webhook_url": "https://your-server.com/live-transcript",
"provider": "deepgram_streaming",
"model": "nova-2"
},
"recording_config": {
"transcript": {
"provider": "assemblyai",
"speech_models": ["universal-2"],
"speaker_labels": True
}
}
}
response = requests.post(
"https://api.meetstream.ai/api/v1/bots/create_bot",
json=bot_payload,
headers={"Authorization": "Token YOUR_API_KEY"}
)
FAQ
What is the latency of real time transcription with the MeetStream API?
Using Deepgram nova-2 streaming or AssemblyAI universal-streaming-english, end-to-end latency from speech to finalized webhook delivery is typically 1 to 3 seconds under normal network conditions. Interim partial words arrive faster (200 to 500ms) but are not yet finalized. The word_is_final and end_of_turn flags in the webhook payload tell you which state you are in.
Is real time vs post call transcription accuracy difference large enough to matter?
For clean audio, the difference is 2 to 5 WER percentage points, which is small enough that either approach works for most NLP tasks. For noisy or technical audio, the gap can reach 15 or more percentage points, which is significant. If your downstream tasks (summarization, action item extraction, search indexing) are sensitive to transcript quality, use post-call for the stored record. See improving transcription accuracy for WER measurement approaches.
Does streaming transcription api cost more than post-call?
It depends on the provider's billing model. For some providers, streaming is billed at the same rate as batch but runs for the full meeting duration, so total cost is equivalent. For others, streaming carries a premium rate. If cost is a concern, compare the provider's streaming vs batch pricing and factor in whether you need in-meeting functionality at all. Running only post-call transcription is always the lower-cost option when real-time features are not required.
Can I combine live meeting transcription with post-call for the same bot?
Yes, and this is a common production pattern. Set both live_transcription_required and recording_config.transcript in the same bot creation request. The two systems operate independently: the live stream fires webhooks throughout the meeting; the post-call transcript becomes available after transcription.processed fires. Use the live stream for in-meeting features and discard it after the meeting; use the post-call transcript as the authoritative stored record.
What is the end_of_turn field used for in live meeting transcription?
The end_of_turn field signals that the current speaker has finished their utterance. It fires after a silence threshold (typically 200 to 400ms of no speech) following the last word. Use this event to trigger utterance-level logic: CRM logging, keyword detection, or displaying a completed caption bubble. Do not process words individually for these tasks; wait for end_of_turn: true with word_is_final: true to get the complete, confirmed utterance text.
Frequently Asked Questions
What is the typical latency difference between real-time and post-call transcription?
Real-time transcription delivers partial results within 300-800ms of audio capture, depending on the STT provider and chunk size. Post-call transcription processes the full audio file after the call ends and typically completes in 0.1x to 0.5x the meeting duration, so a 60-minute meeting takes 6-30 minutes to transcribe.
Which transcription mode is more accurate?
Post-call transcription is generally 10-15% more accurate because the model has full audio context, can apply speaker diarization retrospectively, and can use larger beam sizes without latency constraints. Real-time systems trade some accuracy for immediacy, making them better suited for live coaching or in-call alerts.
Can I run both real-time and post-call transcription on the same meeting?
Yes. A common pattern is to stream low-latency partial transcripts during the call for immediate use cases like live captions, then run a higher-accuracy post-call pass that corrects errors and produces the final stored transcript. The MeetStream API supports both modes on the same bot session.
When should I choose real-time over post-call transcription?
Choose real-time if your use case requires in-call actions: live sentiment alerts, automatic question detection, or real-time coaching overlays. Choose post-call if your primary goal is accurate documentation, CRM logging, or training data, where a few minutes of delay is acceptable and accuracy matters more.
