Capturing Audio Streams from Meeting Bots via WebSocket
Most meeting bot APIs give you a transcript. MeetStream gives you the raw audio. That distinction matters for a specific class of applications: real-time voice agents that need to respond mid-call, custom speech recognition models that outperform generic APIs on your domain, speaker diarization pipelines that require access to the raw signal, and audio analytics systems that detect emotion, energy, or speaking rate. None of these are possible from a transcript alone.
Meeting bot audio streaming via WebSocket is the mechanism that makes this possible. When you create a bot with live_audio_required set to a WebSocket URL, the bot connects to your server and begins streaming audio frames as binary WebSocket messages. Your server receives them, parses them, and does whatever it needs to do with the raw audio. The bot is the client. Your server is the receiver.
The frame format is a binary protocol, not JSON or a media container. There are no audio headers, no codec wrappers, no timestamps embedded in the audio data itself. What you receive is raw PCM int16 little-endian mono audio at 48kHz, preceded by a small header that identifies the speaker. If you have not worked with raw PCM before, this requires understanding the struct layout at the byte level to parse correctly.
In this guide, you will get the complete specification of MeetStream's live audio WebSocket: the connection lifecycle, the byte-level frame format, the audio properties, and working server implementations in Python (asyncio + websockets) and Node.js. Let's get into it.
Connection Lifecycle: Bot Connects to You
The key architectural point is that the MeetStream bot acts as the WebSocket client. Your server must be running and accessible before the bot enters the meeting. The bot connects to the URL you specify in the live_audio_required field when creating the bot.
import httpx
API_KEY = "YOUR_API_KEY"
response = httpx.post(
"https://api.meetstream.ai/api/v1/bots/create_bot",
headers={"Authorization": f"Token {API_KEY}"},
json={
"meeting_link": "https://meet.google.com/abc-defg-hij",
"bot_name": "MeetStream Bot",
"live_audio_required": {
"websocket_url": "wss://your-server.example.com/audio"
}
}
)
bot = response.json()

The lifecycle follows the bot's state machine. When the bot joins the meeting (bot.inmeeting webhook event), it begins the WebSocket handshake with your server. Audio frames start flowing within a few seconds. When the meeting ends or the bot is removed, the bot closes the WebSocket connection. The bot.stopped webhook fires after the WebSocket closes.
Your server must be reachable from MeetStream's infrastructure over the public internet. Local development requires a tunnel like ngrok or a deployed server. The WebSocket must use TLS (wss://) in production.
Binary Frame Format: Byte-Level Specification
Each binary WebSocket message is a single audio frame with this layout:
| Field | Size | Type | Description |
|---|---|---|---|
| msg_type | 1 byte | uint8 | Message type identifier |
| sid_len | 2 bytes | uint16 LE | Length of speaker_id in bytes |
| speaker_id | sid_len bytes | UTF-8 string | Unique identifier for the speaker |
| sname_len | 2 bytes | uint16 LE | Length of speaker_name in bytes |
| speaker_name | sname_len bytes | UTF-8 string | Display name of the speaker |
| audio | variable | PCM int16 LE | Raw audio samples, 48kHz mono |
All multi-byte integers are little-endian. Speaker IDs are unique per participant across the meeting session. Speaker names are the display names shown in the meeting platform. The audio data begins immediately after the speaker name bytes with no padding or alignment.
Audio properties: sample rate 48,000 Hz, format int16 (signed 16-bit), channel layout mono (single channel), byte order little-endian, no container or codec. Each sample is 2 bytes. Frame duration varies but is typically 10-20ms, meaning 480 to 960 samples (960 to 1920 bytes) per frame.
Python WebSocket Server with asyncio
The Python implementation uses the websockets library with asyncio. The frame parser reads each WebSocket message as bytes and unpacks the header fields sequentially with struct.

import asyncio
import struct
import numpy as np
from websockets.server import serve
from typing import Optional
from dataclasses import dataclass
@dataclass
class AudioFrame:
msg_type: int
speaker_id: str
speaker_name: str
samples: np.ndarray # int16, 48kHz, mono
def parse_frame(data: bytes) -> Optional[AudioFrame]:
if len(data) < 5:
return None
pos = 0
msg_type = data[pos]
pos += 1
sid_len = struct.unpack_from(" len(data):
return None
speaker_id = data[pos:pos + sid_len].decode("utf-8", errors="replace")
pos += sid_len
if pos + 2 > len(data):
return None
sname_len = struct.unpack_from(" len(data):
return None
speaker_name = data[pos:pos + sname_len].decode("utf-8", errors="replace")
pos += sname_len
audio_bytes = data[pos:]
if len(audio_bytes) % 2 != 0:
audio_bytes = audio_bytes[:-1] # drop incomplete sample
samples = np.frombuffer(audio_bytes, dtype=np.int16)
return AudioFrame(msg_type, speaker_id, speaker_name, samples)
async def audio_handler(websocket):
remote = websocket.remote_address
print(f"Bot connected from {remote}")
try:
async for message in websocket:
if not isinstance(message, bytes):
continue
frame = parse_frame(message)
if frame is None:
continue
# Process frame here
print(f" [{frame.speaker_name}] {len(frame.samples)} samples")
except Exception as e:
print(f"Connection error: {e}")
finally:
print(f"Bot disconnected from {remote}")
async def main():
async with serve(audio_handler, "0.0.0.0", 8765):
print("Listening on ws://0.0.0.0:8765")
await asyncio.Future()
if __name__ == "__main__":
asyncio.run(main())
Node.js WebSocket Server
The Node.js implementation uses the ws library. Buffer operations replace Python's struct module. The frame parsing logic is equivalent but uses Buffer.readUInt8, Buffer.readUInt16LE, and Buffer.slice.
const WebSocket = require('ws');
function parseFrame(buffer) {
if (buffer.length < 5) return null;
let pos = 0;
const msgType = buffer.readUInt8(pos);
pos += 1;
const sidLen = buffer.readUInt16LE(pos);
pos += 2;
if (pos + sidLen > buffer.length) return null;
const speakerId = buffer.slice(pos, pos + sidLen).toString('utf8');
pos += sidLen;
if (pos + 2 > buffer.length) return null;
const snameLen = buffer.readUInt16LE(pos);
pos += 2;
if (pos + snameLen > buffer.length) return null;
const speakerName = buffer.slice(pos, pos + snameLen).toString('utf8');
pos += snameLen;
// Audio: raw PCM int16 LE
// Each sample is 2 bytes; total samples = remaining bytes / 2
const audioBuffer = buffer.slice(pos);
const sampleCount = Math.floor(audioBuffer.length / 2);
// Read into Int16Array for typed access
const samples = new Int16Array(
audioBuffer.buffer,
audioBuffer.byteOffset,
sampleCount
);
return { msgType, speakerId, speakerName, samples };
}
const wss = new WebSocket.Server({ port: 8765 });
wss.on('connection', (ws, req) => {
console.log(`Bot connected from ${req.socket.remoteAddress}`);
ws.on('message', (data) => {
if (!(data instanceof Buffer)) return;
const frame = parseFrame(data);
if (!frame) return;
console.log(` [${frame.speakerName}] ${frame.samples.length} samples`);
// Process frame here
});
ws.on('close', () => {
console.log('Bot disconnected');
});
ws.on('error', (err) => {
console.error('WebSocket error:', err.message);
});
});
console.log('Listening on ws://0.0.0.0:8765');
Handling Multiple Concurrent Bots
A single WebSocket server handles multiple simultaneous bot connections. Each connection is independent, and the speaker ID within each connection is scoped to that meeting session. You need to associate each WebSocket connection with a bot ID to route audio to the correct processing pipeline.
# Python: associate connection with bot_id via query parameter
import urllib.parse
async def audio_handler(websocket):
# Extract bot_id from WebSocket URL: wss://server/audio?bot_id=xxx
query = urllib.parse.parse_qs(websocket.request.path.split("?", 1)[-1])
bot_id = query.get("bot_id", ["unknown"])[0]
print(f"Bot {bot_id} connected")
# Now route frames to a per-bot pipeline
async for message in websocket:
if isinstance(message, bytes):
frame = parse_frame(message)
# route_to_pipeline(bot_id, frame)
Include the bot ID as a query parameter in the WebSocket URL when creating the bot: wss://server/audio?bot_id={bot_id}. This gives your server the context it needs to associate incoming audio with the correct meeting session from the first frame.

Conclusion
MeetStream's audio streaming WebSocket gives your application raw PCM access to every speaker in a meeting, in real time, with minimal latency. The binary frame protocol is compact and consistent, and the implementations in Python and Node.js are straightforward once you understand the byte layout. This is the foundation for real-time transcription, voice agents, custom speech models, and audio analytics. Start with the MeetStream documentation and deploy your first audio-receiving bot from app.meetstream.ai.
How does MeetStream's live audio WebSocket work?
When you create a bot with live_audio_required set to a WebSocket URL, MeetStream's bot acts as the WebSocket client and connects to your server when it enters the meeting. Your server receives binary frames containing raw PCM audio prefixed with speaker identity headers. The connection stays open for the duration of the meeting and closes when the bot leaves. Your server must be publicly accessible over TLS (wss://) in production.
What audio format does MeetStream stream over WebSocket?
MeetStream streams raw PCM int16 little-endian mono audio at 48,000 Hz. There are no container headers, no codec wrappers, and no metadata embedded in the audio bytes themselves. Each frame contains a variable number of samples depending on the frame duration (typically 10-20ms, which is 480 to 960 samples). The audio bytes begin immediately after the speaker name field in the binary frame without any padding.
How do I parse the speaker identity from a MeetStream audio frame?
The speaker identity is encoded in a header before the audio bytes. After the one-byte message type, read two bytes as a little-endian uint16 to get the speaker ID length. Read that many bytes as UTF-8 to get the speaker ID. Then read two more bytes as uint16 for the speaker name length, and read that many bytes as UTF-8 for the speaker name. All remaining bytes are PCM audio. In Python, use the struct module with format codes "B" for uint8 and "
Can I run a WebSocket audio server locally during development?
You need a tunnel to expose your local server to MeetStream's bot infrastructure. ngrok is the most common option: run ngrok tcp 8765 or ngrok http 8765 and use the generated wss:// URL in your bot creation request. The connection is real-time, so latency in the tunnel affects your audio processing pipeline. For latency-sensitive workloads, deploy to a cloud server close to MeetStream's infrastructure rather than tunneling from a local machine.
What is the difference between live_audio_required and live_transcription_required in MeetStream?
live_audio_required delivers raw PCM audio frames over WebSocket to your server, giving you access to the unprocessed audio signal. live_transcription_required sends transcription results as JSON webhook payloads after each speaker turn ends. Use live_audio_required when you need the audio itself, such as for a custom speech model, real-time voice agent, or audio analytics. Use live_transcription_required when you only need text output, such as for CRM note creation, action item extraction, or meeting summaries. Both can be enabled simultaneously on the same bot.
Frequently Asked Questions
What WebSocket format does MeetStream use to deliver audio streams?
MeetStream delivers audio over a WebSocket connection as raw PCM data, typically at 16kHz mono, chunked in 20ms frames. Each frame includes a timestamp and participant identifier so downstream consumers can correlate audio segments to speakers without additional buffering.
How do I handle audio dropout or packet loss in a WebSocket stream?
Implement a circular buffer on the client side and detect gaps using sequence numbers or timestamp deltas. When a gap exceeds 200ms, insert silence padding before passing the buffer to your transcription engine to avoid splice artifacts that cause word-error-rate spikes.
Can I receive separate audio streams per participant?
Yes. MeetStream supports per-participant audio isolation, meaning each speaker's audio is delivered on a separate WebSocket channel keyed by participant ID. This is critical for accurate speaker diarization and avoids the mixing artifacts common in composite audio streams.
What sample rate should I use for speech transcription?
Most speech-to-text APIs, including Whisper and Google STT, perform best at 16kHz mono. If your downstream system requires 8kHz (telephony-grade), you can downsample using librosa or ffmpeg without significant quality loss for standard speech content.
