Capture Audio Streams from Meeting Bots

Most meeting bot APIs give you a transcript. MeetStream gives you the raw audio. That distinction matters for a specific class of applications: real-time voice agents that need to respond mid-call, custom speech recognition models that outperform generic APIs on your domain, speaker diarization pipelines that require access to the raw signal, and audio analytics systems that detect emotion, energy, or speaking rate. None of these are possible from a transcript alone.

Meeting bot audio streaming via WebSocket is the mechanism that makes this possible. When you create a bot with live_audio_required set to a WebSocket URL, the bot connects to your server and begins streaming audio frames as binary WebSocket messages. Your server receives them, parses them, and does whatever it needs to do with the raw audio. The bot is the client. Your server is the receiver.

The frame format is a binary protocol, not JSON or a media container. There are no audio headers, no codec wrappers, no timestamps embedded in the audio data itself. What you receive is raw PCM int16 little-endian mono audio at 48kHz, preceded by a small header that identifies the speaker. If you have not worked with raw PCM before, this requires understanding the struct layout at the byte level to parse correctly.

In this guide, you will get the complete specification of MeetStream's live audio WebSocket: the connection lifecycle, the byte-level frame format, the audio properties, and working server implementations in Python (asyncio + websockets) and Node.js. Let's get into it.

Connection Lifecycle: Bot Connects to You

The key architectural point is that the MeetStream bot acts as the WebSocket client. Your server must be running and accessible before the bot enters the meeting. The bot connects to the URL you specify in the live_audio_required field when creating the bot.

import httpx

API_KEY = "YOUR_API_KEY"

response = httpx.post(
    "https://api.meetstream.ai/api/v1/bots/create_bot",
    headers={"Authorization": f"Token {API_KEY}"},
    json={
        "meeting_link": "https://meet.google.com/abc-defg-hij",
        "bot_name": "MeetStream Bot",
        "live_audio_required": {
            "websocket_url": "wss://your-server.example.com/audio"
        }
    }
)
bot = response.json()

Python audio streaming over WebSocket with asyncio — Python-based audio streaming over WebSocket using asyncio. Source: Medium.

The lifecycle follows the bot's state machine. When the bot joins the meeting (bot.inmeeting webhook event), it begins the WebSocket handshake with your server. Audio frames start flowing within a few seconds. When the meeting ends or the bot is removed, the bot closes the WebSocket connection. The bot.stopped webhook fires after the WebSocket closes.

Your server must be reachable from MeetStream's infrastructure over the public internet. Local development requires a tunnel like ngrok or a deployed server. The WebSocket must use TLS (wss://) in production.

Binary Frame Format: Byte-Level Specification

Each binary WebSocket message is a single audio frame with this layout:

Field	Size	Type	Description
msg_type	1 byte	uint8	Message type identifier
sid_len	2 bytes	uint16 LE	Length of speaker_id in bytes
speaker_id	sid_len bytes	UTF-8 string	Unique identifier for the speaker
sname_len	2 bytes	uint16 LE	Length of speaker_name in bytes
speaker_name	sname_len bytes	UTF-8 string	Display name of the speaker
audio	variable	PCM int16 LE	Raw audio samples, 48kHz mono

All multi-byte integers are little-endian. Speaker IDs are unique per participant across the meeting session. Speaker names are the display names shown in the meeting platform. The audio data begins immediately after the speaker name bytes with no padding or alignment.

Audio properties: sample rate 48,000 Hz, format int16 (signed 16-bit), channel layout mono (single channel), byte order little-endian, no container or codec. Each sample is 2 bytes. Frame duration varies but is typically 10-20ms, meaning 480 to 960 samples (960 to 1920 bytes) per frame.

Python WebSocket Server with asyncio

The Python implementation uses the websockets library with asyncio. The frame parser reads each WebSocket message as bytes and unpacks the header fields sequentially with struct.

WebSocket asyncio threading Python — WebSocket streaming with asyncio and threading in Python. Source: Medium.

import asyncio
import struct
import numpy as np
from websockets.server import serve
from typing import Optional
from dataclasses import dataclass

@dataclass
class AudioFrame:
    msg_type: int
    speaker_id: str
    speaker_name: str
    samples: np.ndarray  # int16, 48kHz, mono

def parse_frame(data: bytes) -> Optional[AudioFrame]:
    if len(data) < 5:
        return None

    pos = 0
    msg_type = data[pos]
    pos += 1

    sid_len = struct.unpack_from(" len(data):
        return None
    speaker_id = data[pos:pos + sid_len].decode("utf-8", errors="replace")
    pos += sid_len

    if pos + 2 > len(data):
        return None
    sname_len = struct.unpack_from(" len(data):
        return None
    speaker_name = data[pos:pos + sname_len].decode("utf-8", errors="replace")
    pos += sname_len

    audio_bytes = data[pos:]
    if len(audio_bytes) % 2 != 0:
        audio_bytes = audio_bytes[:-1]  # drop incomplete sample
    samples = np.frombuffer(audio_bytes, dtype=np.int16)

    return AudioFrame(msg_type, speaker_id, speaker_name, samples)

async def audio_handler(websocket):
    remote = websocket.remote_address
    print(f"Bot connected from {remote}")
    try:
        async for message in websocket:
            if not isinstance(message, bytes):
                continue
            frame = parse_frame(message)
            if frame is None:
                continue
            # Process frame here
            print(f"  [{frame.speaker_name}] {len(frame.samples)} samples")
    except Exception as e:
        print(f"Connection error: {e}")
    finally:
        print(f"Bot disconnected from {remote}")

async def main():
    async with serve(audio_handler, "0.0.0.0", 8765):
        print("Listening on ws://0.0.0.0:8765")
        await asyncio.Future()

if __name__ == "__main__":
    asyncio.run(main())

Node.js WebSocket Server

The Node.js implementation uses the ws library. Buffer operations replace Python's struct module. The frame parsing logic is equivalent but uses Buffer.readUInt8, Buffer.readUInt16LE, and Buffer.slice.

const WebSocket = require('ws');

function parseFrame(buffer) {
    if (buffer.length < 5) return null;

    let pos = 0;
    const msgType = buffer.readUInt8(pos);
    pos += 1;

    const sidLen = buffer.readUInt16LE(pos);
    pos += 2;
    if (pos + sidLen > buffer.length) return null;
    const speakerId = buffer.slice(pos, pos + sidLen).toString('utf8');
    pos += sidLen;

    if (pos + 2 > buffer.length) return null;
    const snameLen = buffer.readUInt16LE(pos);
    pos += 2;
    if (pos + snameLen > buffer.length) return null;
    const speakerName = buffer.slice(pos, pos + snameLen).toString('utf8');
    pos += snameLen;

    // Audio: raw PCM int16 LE
    // Each sample is 2 bytes; total samples = remaining bytes / 2
    const audioBuffer = buffer.slice(pos);
    const sampleCount = Math.floor(audioBuffer.length / 2);

    // Read into Int16Array for typed access
    const samples = new Int16Array(
        audioBuffer.buffer,
        audioBuffer.byteOffset,
        sampleCount
    );

    return { msgType, speakerId, speakerName, samples };
}

const wss = new WebSocket.Server({ port: 8765 });

wss.on('connection', (ws, req) => {
    console.log(`Bot connected from ${req.socket.remoteAddress}`);

    ws.on('message', (data) => {
        if (!(data instanceof Buffer)) return;
        const frame = parseFrame(data);
        if (!frame) return;
        console.log(`  [${frame.speakerName}] ${frame.samples.length} samples`);
        // Process frame here
    });

    ws.on('close', () => {
        console.log('Bot disconnected');
    });

    ws.on('error', (err) => {
        console.error('WebSocket error:', err.message);
    });
});

console.log('Listening on ws://0.0.0.0:8765');

Handling Multiple Concurrent Bots

A single WebSocket server handles multiple simultaneous bot connections. Each connection is independent, and the speaker ID within each connection is scoped to that meeting session. You need to associate each WebSocket connection with a bot ID to route audio to the correct processing pipeline.

# Python: associate connection with bot_id via query parameter
import urllib.parse

async def audio_handler(websocket):
    # Extract bot_id from WebSocket URL: wss://server/audio?bot_id=xxx
    query = urllib.parse.parse_qs(websocket.request.path.split("?", 1)[-1])
    bot_id = query.get("bot_id", ["unknown"])[0]
    print(f"Bot {bot_id} connected")
    # Now route frames to a per-bot pipeline
    async for message in websocket:
        if isinstance(message, bytes):
            frame = parse_frame(message)
            # route_to_pipeline(bot_id, frame)

Include the bot ID as a query parameter in the WebSocket URL when creating the bot: wss://server/audio?bot_id={bot_id}. This gives your server the context it needs to associate incoming audio with the correct meeting session from the first frame.

WebSocket server client architecture — WebSocket server and client communication flow. Source: GeekPython.

Conclusion

MeetStream's audio streaming WebSocket gives your application raw PCM access to every speaker in a meeting, in real time, with minimal latency. The binary frame protocol is compact and consistent, and the implementations in Python and Node.js are straightforward once you understand the byte layout. This is the foundation for real-time transcription, voice agents, custom speech models, and audio analytics. Start with the MeetStream documentation and deploy your first audio-receiving bot from app.meetstream.ai.

How does MeetStream's live audio WebSocket work?

When you create a bot with live_audio_required set to a WebSocket URL, MeetStream's bot acts as the WebSocket client and connects to your server when it enters the meeting. Your server receives binary frames containing raw PCM audio prefixed with speaker identity headers. The connection stays open for the duration of the meeting and closes when the bot leaves. Your server must be publicly accessible over TLS (wss://) in production.

What audio format does MeetStream stream over WebSocket?

MeetStream streams raw PCM int16 little-endian mono audio at 48,000 Hz. There are no container headers, no codec wrappers, and no metadata embedded in the audio bytes themselves. Each frame contains a variable number of samples depending on the frame duration (typically 10-20ms, which is 480 to 960 samples). The audio bytes begin immediately after the speaker name field in the binary frame without any padding.

How do I parse the speaker identity from a MeetStream audio frame?

The speaker identity is encoded in a header before the audio bytes. After the one-byte message type, read two bytes as a little-endian uint16 to get the speaker ID length. Read that many bytes as UTF-8 to get the speaker ID. Then read two more bytes as uint16 for the speaker name length, and read that many bytes as UTF-8 for the speaker name. All remaining bytes are PCM audio. In Python, use the struct module with format codes "B" for uint8 and "

Can I run a WebSocket audio server locally during development?

You need a tunnel to expose your local server to MeetStream's bot infrastructure. ngrok is the most common option: run ngrok tcp 8765 or ngrok http 8765 and use the generated wss:// URL in your bot creation request. The connection is real-time, so latency in the tunnel affects your audio processing pipeline. For latency-sensitive workloads, deploy to a cloud server close to MeetStream's infrastructure rather than tunneling from a local machine.

What is the difference between live_audio_required and live_transcription_required in MeetStream?

live_audio_required delivers raw PCM audio frames over WebSocket to your server, giving you access to the unprocessed audio signal. live_transcription_required sends transcription results as JSON webhook payloads after each speaker turn ends. Use live_audio_required when you need the audio itself, such as for a custom speech model, real-time voice agent, or audio analytics. Use live_transcription_required when you only need text output, such as for CRM note creation, action item extraction, or meeting summaries. Both can be enabled simultaneously on the same bot.

Frequently Asked Questions

What WebSocket format does MeetStream use to deliver audio streams?

MeetStream delivers audio over a WebSocket connection as raw PCM data, typically at 16kHz mono, chunked in 20ms frames. Each frame includes a timestamp and participant identifier so downstream consumers can correlate audio segments to speakers without additional buffering.

How do I handle audio dropout or packet loss in a WebSocket stream?

Implement a circular buffer on the client side and detect gaps using sequence numbers or timestamp deltas. When a gap exceeds 200ms, insert silence padding before passing the buffer to your transcription engine to avoid splice artifacts that cause word-error-rate spikes.

Can I receive separate audio streams per participant?

Yes. MeetStream supports per-participant audio isolation, meaning each speaker's audio is delivered on a separate WebSocket channel keyed by participant ID. This is critical for accurate speaker diarization and avoids the mixing artifacts common in composite audio streams.

What sample rate should I use for speech transcription?

Most speech-to-text APIs, including Whisper and Google STT, perform best at 16kHz mono. If your downstream system requires 8kHz (telephony-grade), you can downsample using librosa or ffmpeg without significant quality loss for standard speech content.

Capturing Audio Streams from Meeting Bots via WebSocket