Scaling Meeting Bots: Architecture Guide for High Concurrency

At 5 concurrent bots, your single-server setup works fine. At 50, it starts groaning. At 500, it falls over and not in a way that's easy to debug, because the failure modes are spread across browser processes, WebSocket connection limits, media processing CPU, and whatever database you're using to track bot state. The first time you hit a Monday morning rush of 200 simultaneous standups, you want to have solved this already.

The core problem is that meeting bots are not cheap stateless workers. Each one runs a headless browser (400, 800 MB RAM), maintains persistent WebSocket connections to the meeting platform (which penalizes reconnects with admission delays or outright rejection), and processes continuous audio/video streams that are CPU-intensive per instance. When you multiply that by hundreds of concurrent sessions, the resource math becomes unforgiving fast.

Most teams discover this the hard way, after a customer demo where 40 users start their Monday syncs simultaneously and your bot fleet goes dark. The architecture for meeting bot scaling isn't complicated, but it requires getting several independent pieces right at the same time: compute provisioning, queue-based dispatch, session state isolation, and webhook processing that stays fast under load.

In this guide, we'll cover what actually makes bots hard to scale, horizontal scaling patterns, queue-based dispatch with SQS and Redis Streams, per-session state management, and how MeetStream's managed infrastructure handles the hard parts so you don't have to. Let's get into it.

What Makes Concurrent Meeting Bots Hard

Three distinct resource bottlenecks constrain a self-hosted bot fleet:

Browser instance memory. Chromium requires roughly 500, 800 MB of RAM per instance under active media capture. That's not a leak, it's baseline. On a machine with 32 GB RAM, you can run roughly 35, 50 bots before you're in swap territory. Audio and video processing push this further: decoding a 720p video stream for capture adds another 100, 200 MB. Budget 1.5 GB per bot as a conservative planning number for capacity calculations.

WebSocket connection concurrency. Each bot maintains at least one persistent WebSocket connection to the meeting platform's signaling server plus one for media. Most Linux kernels default to 65,535 file descriptors per process. That sounds like a lot until you realize each WebSocket occupies one FD, your application sockets occupy more, and you're also doing DNS, TLS, and logging. At 200+ concurrent bots on a single process, file descriptor exhaustion is a real failure mode. Increase ulimit -n to 1,000,000 on your worker hosts and set it at the container level.

Media processing CPU. Decoding, resampling, and forwarding audio/video streams is CPU-intensive. Resampling from 48kHz stereo to 16kHz mono for an STT provider requires per-sample arithmetic on continuous streams. At 100 concurrent bots, this alone can saturate a 4-core machine if you're not careful about vectorization and buffering strategy.

Microservices orchestration with Kubernetes
Microservices orchestration with Kubernetes for horizontal scaling. Source: Medium/FAUN.

Horizontal Scaling Architecture

The right architecture for scaling concurrent meeting bots separates four concerns: orchestration, dispatch, execution, and state.

┌─────────────────┐     ┌──────────────┐     ┌───────────────────┐
│  Orchestration  │────▶│  Job Queue   │────▶│  Bot Worker Pool  │
│  (API Gateway + │     │  (SQS/Redis) │     │  (EC2/ECS fleet)  │
│   Lambda/API)   │     └──────────────┘     └─────────┬─────────┘
└─────────────────┘                                    │
                                                       ▼
                         ┌──────────────┐     ┌───────────────────┐
                         │  State Store │◀────│  Webhook Handler  │
                         │  (DynamoDB)  │     │  (API Gateway +   │
                         └──────────────┘     │   Lambda)         │
                                              └───────────────────┘

The orchestration layer accepts bot creation requests and writes jobs to the queue. It never blocks waiting for a bot to join or finish. Worker VMs consume from the queue, spawn bot processes, and write state updates to the external store. The webhook handler is a separate, stateless service that processes events from the meeting platform and updates state independently.

Queue-Based Dispatch

The job queue decouples bot creation requests from actual bot execution. Without it, your orchestration layer is directly responsible for bot processes and a spike in demand means either overloading a single host or building complex load balancing logic.

With SQS, each bot creation request becomes a message. Worker hosts long-poll the queue and claim messages when they have capacity. This naturally handles demand spikes: messages queue up and workers process them as fast as they can, without dropping requests or overloading any single host.

import boto3, json, os, time

sqs = boto3.client('sqs', region_name='us-east-1')
QUEUE_URL = os.environ['BOT_QUEUE_URL']
MAX_CONCURRENT_BOTS = int(os.environ.get('MAX_CONCURRENT_BOTS', '8'))

class BotWorker:
    def __init__(self):
        self.active_bots = {}

    def run(self):
        while True:
            available_slots = MAX_CONCURRENT_BOTS - len(self.active_bots)
            if available_slots <= 0:
                time.sleep(1)
                continue

            messages = sqs.receive_message(
                QueueUrl=QUEUE_URL,
                MaxNumberOfMessages=min(available_slots, 10),
                WaitTimeSeconds=20
            ).get('Messages', [])

            for msg in messages:
                body = json.loads(msg['Body'])
                bot_id = self._spawn_bot(body)
                self.active_bots[bot_id] = msg['ReceiptHandle']
                sqs.delete_message(
                    QueueUrl=QUEUE_URL,
                    ReceiptHandle=msg['ReceiptHandle']
                )

    def _spawn_bot(self, config: dict) -> str:
        # Launch bot subprocess, return bot_id
        # In practice: subprocess.Popen or asyncio.create_subprocess_exec
        pass

For Redis Streams (preferred if you're already running Redis), the pattern is similar but with consumer groups, which give you persistent acknowledgment and dead-letter semantics without managing SQS visibility timeouts manually.

import redis, json, os

r = redis.Redis(host=os.environ['REDIS_HOST'], decode_responses=True)
STREAM = 'bot:jobs'
GROUP = 'bot-workers'
CONSUMER = f'worker-{os.getpid()}'

# Create group if not exists
try:
    r.xgroup_create(STREAM, GROUP, id='0', mkstream=True)
except redis.exceptions.ResponseError:
    pass  # group already exists

def poll_and_dispatch():
    while True:
        entries = r.xreadgroup(
            GROUP, CONSUMER, {STREAM: '>'}, count=5, block=5000
        )
        if not entries:
            continue
        for stream, messages in entries:
            for msg_id, data in messages:
                config = json.loads(data['payload'])
                spawn_bot(config)
                r.xack(STREAM, GROUP, msg_id)
Kubernetes architecture for scaling bots
Kubernetes architecture diagram for scaling bot worker fleets. Source: Simform.

Per-Session State Management

When you're running dozens of workers each handling multiple bots, state cannot live in process memory. Any worker can receive a webhook event for any bot, so state must be accessible from all workers simultaneously.

DynamoDB is the standard choice for bot session state. It's fast, scales with your workload, and the access pattern, read/write by bot_id, maps directly to a partition key.

import boto3
from datetime import datetime, timezone

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('meetstream-bot-sessions')

def create_session(bot_id: str, meeting_url: str, user_id: str):
    table.put_item(Item={
        'bot_id': bot_id,
        'meeting_url': meeting_url,
        'user_id': user_id,
        'status': 'Joining',
        'created_at': datetime.now(timezone.utc).isoformat(),
        'worker_id': os.environ['WORKER_ID'],
        'transcript_received': False,
        'audio_received': False
    })

def update_status(bot_id: str, status: str):
    table.update_item(
        Key={'bot_id': bot_id},
        UpdateExpression='SET #s = :s, updated_at = :t',
        ExpressionAttributeNames={'#s': 'status'},
        ExpressionAttributeValues={
            ':s': status,
            ':t': datetime.now(timezone.utc).isoformat()
        }
    )

Set a TTL on session records, 7 days is a reasonable default. This keeps your table from growing unbounded as you process thousands of bot sessions over time.

Scaling the Webhook Handler

Under high concurrency, your webhook endpoint receives a high volume of events. bot.stopping, audio.processed, and transcription.processed events arrive for every bot session. A synchronous handler that does database writes inline will start queuing under load.

The fast pattern: accept the webhook, validate the signature, and write the raw payload to a queue (SQS, Kinesis, or Redis Streams). Return HTTP 200 immediately. A separate consumer process reads from the queue and does the actual state updates and downstream processing. This keeps your webhook handler's P99 response time under 50 ms regardless of load, which is important because some meeting platforms will time out or retry if your webhook takes too long.

Note: MeetStream webhooks are not retried on non-2xx responses. Your webhook endpoint must return 200 reliably and quickly, use the queue-then-process pattern to guarantee this. See the webhook documentation for the full event schema.

Auto-Scaling Worker Fleets

For variable workloads, configure auto-scaling based on queue depth. In AWS, a CloudWatch alarm on the ApproximateNumberOfMessagesVisible SQS metric can trigger an ECS service to scale out when the queue depth exceeds a threshold.

KEDA event-driven autoscaling Kubernetes
Kubernetes Event-Driven Autoscaler (KEDA) for scaling meeting bot workers. Source: Medium.

Practical scaling parameters based on observed bot workloads:

MetricScale Out ThresholdScale In Threshold
Queue depth> 10 messages< 2 messages
CPU utilization> 70%< 30%
Memory utilization> 75%< 40%
Active bots per host> 8 of max< 3 of max

Add a cooldown period of at least 3 minutes on scale-in to avoid thrashing, Chromium startup time means a bot that gets scheduled to a terminating instance will fail to join.

How MeetStream Fits

MeetStream's scale meeting bots infrastructure handles all of the above, queue dispatch, browser process management, auto-scaling compute, and state tracking, as a managed service. You POST to create a bot and receive webhook events; the concurrency handling happens on MeetStream's side. The API supports hundreds of concurrent meeting bots without any queue configuration or fleet management on your end. See the full API reference for rate limits and concurrency tiers.

Conclusion

Scaling meeting bots requires treating them as long-running, resource-intensive processes rather than typical web requests. The key architectural decisions are: queue-based dispatch to decouple demand from execution, external state storage (DynamoDB) so any worker can handle any webhook, stateless webhook handlers that queue events before processing them, and auto-scaling rules based on queue depth rather than CPU alone. Chromium's memory footprint is the binding constraint, plan 1.5 GB per concurrent bot and scale accordingly. If you'd rather ship product than tune fleet parameters, get started free at meetstream.ai.

Frequently Asked Questions

How many concurrent meeting bots can a single server handle?

On a typical c5.4xlarge (16 vCPU, 32 GB RAM), you can safely run 15, 20 concurrent Chromium-based bots under active media capture. Budget 1.5 GB RAM and 1 vCPU per bot as a planning baseline, and reduce the cap if you're doing heavy media processing like real-time transcription or video analysis alongside capture.

Why use SQS for bot dispatch instead of direct API calls to workers?

SQS decouples demand spikes from worker capacity, provides natural backpressure, handles worker failures with visibility timeouts and dead-letter queues, and gives you queue depth metrics for auto-scaling. Direct worker calls require you to implement all of that yourself, load balancing, retry logic, health checks, and backpressure.

What happens to bot state if a worker host crashes mid-session?

If state is in external DynamoDB, the session record persists. Your recovery logic should monitor for sessions stuck in Joining or InMeeting status without a recent update timestamp, then decide whether to re-dispatch or mark them as errored. The meeting platform will typically detect the dropped bot connection and trigger a bot.stopped webhook with status Error.

How should I handle webhook events at high volume without dropping them?

Accept the webhook, validate the signature, write the raw payload to a queue, and return HTTP 200 immediately, ideally within 50 ms. Process the queued events asynchronously. This pattern keeps your webhook endpoint fast and reliable regardless of processing load, and prevents timeouts that would cause you to miss events.

What is the right auto-scaling strategy for a bot worker fleet?

Scale out on queue depth (SQS ApproximateNumberOfMessagesVisible > 10) and scale in conservatively with a 3-minute cooldown. Avoid scaling based on CPU alone, a bot in a silent meeting is CPU-idle but still consuming RAM and a WebSocket slot. Use memory utilization as a secondary scale-out signal to avoid OOM conditions.