Zoom Meeting Recording API: Capture Audio and Video
Getting audio and video out of a Zoom meeting is surprisingly non-trivial. Zoom's own APIs require elevated permissions, Server-to-Server OAuth credentials, and a complex consent model that assumes you're the account owner. If you want per-participant audio streams, separate video tracks, or real-time audio access during the meeting, Zoom's native developer tools don't offer a clean path.
The practical alternative is a meeting bot that joins as a participant and captures streams from inside the call. This approach works regardless of who hosts the meeting, requires no elevated Zoom account permissions, and gives you access to both post-call artifacts and real-time audio streams.
This guide covers both modes: post-call recording (video_required + recording_config) for complete meeting capture, and real-time audio streaming (live_audio_required with WebSocket) for applications that need to process audio as the meeting happens. Both use the same MeetStream recording API and the same bot deployment model.
In this guide, we'll cover post-call video and audio capture, the real-time audio WebSocket stream and binary frame format, a Python decoder for the PCM stream, and how to choose between the two approaches for different use cases. Let's get into it.
Post-call video and audio capture
Post-call recording captures the full meeting and makes audio and video artifacts available via API endpoints after the meeting ends. This is the right approach when you need the complete recording for storage, compliance, summarization, or transcription.
Set video_required: true to capture both audio and video. The recording_config block controls transcription settings and retention:
curl -X POST https://api.meetstream.ai/api/v1/bots/create_bot \
-H "Authorization: Token YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"meeting_link": "https://zoom.us/j/123456789",
"bot_name": "Meeting Recorder",
"callback_url": "https://yourapp.com/webhooks/meetstream",
"video_required": true,
"recording_config": {
"transcript": {
"provider": {
"name": "deepgram"
}
},
"retention": {
"type": "timed",
"hours": 72
}
}
}'When the meeting ends, three asynchronous events fire in sequence: audio.processed, video.processed, and transcription.processed. The audio and video events fire before transcription since transcription depends on the audio artifact.

Once video.processed fires, the MP4 file is downloadable:
curl -X GET https://api.meetstream.ai/api/v1/bots/BOT_ID/video \
-H "Authorization: Token YOUR_API_KEY" \
--output meeting_recording.mp4Audio is available separately as an MP3 file at GET /bots/{bot_id}/audio. In many workflows, you only need audio plus transcript, not the full video. Setting video_required: false skips video capture and reduces both processing time and storage requirements. The video.processed event won't fire, but audio.processed and transcription.processed proceed normally.
The video.processed webhook payload
Your callback URL receives this payload when video processing completes:

{
"event": "video.processed",
"bot_id": "bot_abc123",
"transcript_id": "txn_xyz789",
"timestamp": "2026-04-02T15:08:30Z"
}After receiving this event, the video endpoint returns the MP4 file. The video is the composite recording from the bot's perspective inside the meeting: the active speaker view with audio mixed down to a single track. It's equivalent to what a human participant would see and hear.
Real-time audio streaming via WebSocket
For applications that need to process audio as the meeting progresses, live audio streaming via WebSocket is the path. This is the foundation for real-time transcription, live meeting intelligence, voice agent interactions, and anything that needs to act on what's being said before the meeting ends.
Enable it with the live_audio_required parameter, passing your WebSocket server URL:
curl -X POST https://api.meetstream.ai/api/v1/bots/create_bot \
-H "Authorization: Token YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"meeting_link": "https://zoom.us/j/123456789",
"bot_name": "Live Monitor",
"live_audio_required": {
"websocket_url": "wss://yourapp.com/ws/audio"
}
}'Once the bot is in the meeting, it connects to your WebSocket URL and begins streaming binary frames. Each frame contains a header with speaker metadata followed by raw PCM audio data.

Binary frame format: decoding the PCM stream
The audio stream uses a binary frame format. Each frame has a fixed-size header containing metadata about the audio, followed by the raw audio payload.
The audio specification is: PCM16 little-endian, 48kHz sample rate, mono channel. Each frame header contains msg_type (frame type identifier), speaker_id (unique participant identifier), and speaker_name (display name of the speaking participant).
Here's a Python WebSocket server that receives and decodes the audio stream:
import asyncio
import struct
import wave
import io
from websockets.server import serve
from collections import defaultdict
# Audio spec constants
SAMPLE_RATE = 48000
SAMPLE_WIDTH = 2 # 16-bit PCM
CHANNELS = 1 # mono
# Per-speaker audio buffers
speaker_buffers: dict[str, bytearray] = defaultdict(bytearray)
def parse_frame_header(data: bytes) -> tuple[int, str, str, bytes]:
"""
Parse the binary frame header.
Returns: (msg_type, speaker_id, speaker_name, audio_payload)
Header layout:
- 1 byte: msg_type
- 4 bytes: speaker_id length (uint32, little-endian)
- N bytes: speaker_id (UTF-8)
- 4 bytes: speaker_name length (uint32, little-endian)
- M bytes: speaker_name (UTF-8)
- Remaining: PCM audio data
"""
offset = 0
msg_type = data[offset]
offset += 1
speaker_id_len = struct.unpack_from('The key point is that audio is separated by speaker. Each frame identifies which participant is speaking, so you can buffer or process audio per-speaker rather than working with a mixed stream. This matters for applications like per-speaker volume normalization, individual speaker models, or routing different speakers to different processing pipelines.
Combining real-time audio with post-call recording
You can use both live_audio_required and recording_config in the same create_bot request. The real-time WebSocket stream delivers audio as the meeting happens. The post-call recording produces a complete MP4 and transcript after the meeting ends. This combination is useful when you want live features (real-time coaching, in-meeting alerts) and also a permanent archive.
{
"meeting_link": "https://zoom.us/j/123456789",
"bot_name": "Full Capture Bot",
"callback_url": "https://yourapp.com/webhooks/meetstream",
"video_required": true,
"live_audio_required": {
"websocket_url": "wss://yourapp.com/ws/audio"
},
"recording_config": {
"transcript": {
"provider": { "name": "assemblyai" }
}
}
}Tradeoffs: post-call vs. real-time
Post-call recording is simpler to implement, produces higher quality artifacts (the transcription provider processes the full audio at once), and requires no persistent WebSocket server infrastructure on your end. The tradeoff is latency: you can't act on meeting content until after the meeting ends and processing completes, which can be 5 to 15 minutes after the last participant leaves.

Real-time streaming has lower latency (sub-second to a few seconds depending on your processing pipeline) but is more complex: you need a WebSocket server that stays connected for the duration of every meeting, and streaming transcription models have lower accuracy than post-call models due to shorter context windows. For most recording and archival use cases, post-call is the right choice. For real-time coaching, voice agents, or live transcript displays, real-time streaming is necessary.
How MeetStream fits in
MeetStream provides both recording modes through the same bot API. The bot handles Zoom-side authentication and stream capture. You provide a callback URL for webhooks (post-call) or a WebSocket URL (real-time), and MeetStream handles the delivery. The same API works for Google Meet and Teams, so if your application needs to capture audio and video across multiple platforms, you're not managing separate integrations per platform.
Conclusion
Capturing Zoom audio and video via API has two viable paths: post-call recording for complete artifacts with higher transcription quality, and real-time WebSocket streaming for applications that need to process audio during the meeting. The post-call path uses video_required and recording_config in the create_bot request, with artifacts retrieved via bot ID endpoints after processing webhooks fire. The real-time path uses live_audio_required with a WebSocket URL, delivering binary PCM frames (PCM16 LE, 48kHz, mono) with per-speaker headers. For most meeting recording use cases, post-call is simpler and produces better results. For live features, real-time streaming is the right tool.
Get started free at meetstream.ai or see the full API reference at docs.meetstream.ai.
Frequently Asked Questions
What audio format does the Zoom meeting recording API return?
Post-call audio artifacts are returned as MP3 files via the GET /bots/{bot_id}/audio endpoint. Real-time audio via the live_audio_required WebSocket stream is delivered as binary frames containing raw PCM16 little-endian data at 48kHz sample rate, mono channel. Each binary frame includes a header with speaker_id and speaker_name before the PCM payload, allowing per-speaker audio separation.
How do I capture separate audio streams per speaker in Zoom?
The real-time WebSocket audio stream delivers per-speaker frames. Each binary frame contains a header identifying which participant is speaking (by speaker_id and speaker_name) followed by their audio data. Buffer audio by speaker_id on your WebSocket server to maintain separate streams per participant. Post-call recordings are mixed-down to a single track; speaker separation at the post-call level comes from transcription diarization, not separate audio files.
Can I do real-time transcription and post-call recording simultaneously?
Yes. Use live_transcription_required with a webhook_url for real-time transcript segments, combined with recording_config for post-call transcription. You can also combine live_audio_required (for your own WebSocket-based processing) with recording_config (for post-call archive). All three can be active in the same bot session. The real-time stream has lower latency but lower accuracy; post-call transcription is higher accuracy but available only after the meeting ends.
What is the video format for Zoom recording downloads?
Video artifacts are MP4 files, capturing the composite meeting view from the bot's perspective inside the meeting. The video includes the active speaker or shared screen with audio mixed in. Retrieve it via GET /bots/{bot_id}/video after receiving the video.processed webhook event. If you only need audio and transcript, set video_required to false to skip video capture and reduce processing overhead.
How do I know when the Zoom recording is ready to download?
Two webhook events signal readiness: audio.processed fires when the audio file is available, and video.processed fires when the video file is available. Both fire after the bot.stopped event. Do not attempt to download artifacts before receiving the corresponding processed event, as the files may not be finalized yet. Transcription is a separate step signaled by transcription.processed, which fires after audio.processed completes.
