Slack's WebSocket Gateway: Serving 1M Concurrent Connections

The Architecture Challenge

Every Slack client — desktop app, mobile app, browser tab — maintains a persistent WebSocket connection to Slack's servers. This connection carries real-time events: new messages, typing indicators, reaction updates, presence changes.

At Slack's scale in 2022, that means over 1 million simultaneous open connections at peak. Sustaining that many persistent connections on server infrastructure is a fundamentally different problem than handling HTTP requests.

HTTP requests are short-lived: connect, send, respond, disconnect. A web server handling 10,000 requests per second only has ~10,000 connections open at any moment (at ~1ms per request). WebSocket connections are long-lived: a single connection can stay open for hours while the user has Slack in the foreground.

Why the Traditional Thread-Per-Connection Model Fails

The classical server design: one thread per connection. A thread blocks on socket I/O while waiting for data. When data arrives, the thread processes it and sends a response.

This works well at modest scale — hundreds or a few thousand connections. At 1M+ connections:

Thread memory: Each thread uses ~1MB of stack space. 1M threads = 1TB of memory just for stacks. Impossible.
Context switching: The OS scheduler switches between threads. With 1M threads, the overhead of scheduling overwhelms the CPU — more time switching than working.
Lock contention: Any shared resource (caches, routing tables) becomes a bottleneck under thousands of concurrent threads.

The solution that emerged in the 2000s: the event loop model.

The Event Loop Model

Instead of one thread per connection, an event loop server uses a small number of threads (typically one per CPU core) and handles many connections per thread using non-blocking I/O:

Single Thread (Event Loop)
│
├─ Poll OS for I/O events (epoll/kqueue)
│   │
│   ├─ Connection 1: data ready → process message → send reply
│   ├─ Connection 847: connection established → register handler
│   ├─ Connection 12,000: ping received → send pong
│   └─ Connection 999,999: data ready → process message
│
└─ Loop back to poll

The key: when there's no I/O activity on a connection, the thread isn't blocked waiting. It's handling other connections. The OS notifies the thread via epoll (Linux) or kqueue (BSD/Mac) when I/O is ready.

This is conceptually similar to async/await in application code — but at the OS level.

Slack's Gateway Architecture

Slack built a dedicated gateway tier — a layer of servers whose sole job is maintaining WebSocket connections and routing messages:

Slack Clients (1M+ connections)
        │
        ▼
┌─────────────────────────────────────────────┐
│           WebSocket Gateway Tier            │
│   (event-loop servers, small thread pools)  │
│                                             │
│  Gateway-1   Gateway-2   Gateway-3  ...     │
│  200k conns  200k conns  200k conns         │
└──────────────────┬──────────────────────────┘
                   │ Internal pub/sub
                   ▼
┌─────────────────────────────────────────────┐
│         Application Backend Services        │
│   (message storage, search, presence, etc.) │
└─────────────────────────────────────────────┘

The gateway servers are not responsible for business logic — they don't know what a "Slack message" means. They:

Accept WebSocket connections from clients
Authenticate the connection
Subscribe to relevant channels in an internal pub/sub system
Forward events from pub/sub to the connected client
Forward client messages to backend services

This separation allows gateway servers to be optimised for connection management, while backend services handle business logic.

The Internal Pub/Sub System

When a message is sent in a Slack channel, the gateway servers need to push it to every online member. The routing problem:

User A and User B are both in #engineering
User A sends a message
The backend processes it
The backend needs to notify every online member — but their connections are spread across hundreds of gateway servers

The solution: a pub/sub system where:

Each gateway server subscribes to topics for every user it has connected
When a message arrives, the backend publishes to the relevant user topics
Each gateway server receives the events for its connected users and pushes them over WebSocket

Backend publishes:
  topic: "user:456" event: { type: "message", channel: "C123", text: "Hello" }

Gateway-7 (has user 456 connected):
  - Receives event on "user:456" topic
  - Finds the WebSocket for user 456
  - Sends the event JSON down the WebSocket

At Slack's scale, this pub/sub system is itself a distributed service — Slack uses Kafka for durable event delivery, with a real-time fan-out layer on top.

Connection Management Details

Authentication

WebSocket connections begin with an HTTP upgrade request:

HTTP

GET /api/rtm.connect HTTP/1.1
Host: wss://wss-primary.slack.com
Upgrade: websocket
Connection: Upgrade
Authorization: Bearer xoxp-token...

The gateway verifies the token before upgrading the connection. Invalid tokens: HTTP 401, connection rejected.

Heartbeating

WebSocket connections can go silent — the client's network dropped, but the TCP connection is still in ESTABLISHED state. The gateway would hold the connection open forever without a heartbeat.

Slack sends ping frames every 30 seconds. If a pong doesn't arrive within a timeout window, the connection is terminated and cleaned up:

Python

async def connection_heartbeat_loop(conn_id: str, ws):
    while True:
        await asyncio.sleep(30)
        try:
            await asyncio.wait_for(
                ws.ping(),
                timeout=10.0  # expect pong within 10s
            )
        except asyncio.TimeoutError:
            logger.info(f"Connection {conn_id} timed out — closing")
            await ws.close()
            break

Graceful Disconnection

When a gateway server needs to restart (for a deployment or scaling event), it can't abruptly drop 200,000 connections. Clients would immediately reconnect — creating a reconnection storm that could overwhelm the backend.

The solution: graceful drain. The gateway stops accepting new connections and sends a CLOSE frame to existing connections with a short delay between each — staggering reconnections:

Python

async def drain_connections(active_connections: dict, delay_ms=5):
    for conn_id, ws in list(active_connections.items()):
        await ws.close(code=1001, reason="server_restart")
        await asyncio.sleep(delay_ms / 1000)  # stagger reconnects

With 200,000 connections and a 5ms delay between each close: 200,000 * 5ms = ~17 minutes to drain. The deployment orchestrator waits for the drain to complete before terminating the process.

Scaling the Connection Pool

Horizontal Scaling

Adding more gateway servers scales connection capacity linearly. The load balancer uses consistent hashing to route reconnecting clients back to the same gateway server where possible — maintaining subscriptions and avoiding thundering herd.

But consistent hashing alone isn't sufficient if a gateway server goes down: all its clients reconnect to other servers simultaneously. The load balancer adds jitter to reconnection delays — clients reconnect between 0 and 5 seconds after a disconnect, spreading the load.

Connection Multiplexing

Slack uses connection multiplexing for workspace members who are in many channels. Instead of subscribing to each channel's events separately, the gateway maintains a single internal subscription per user and the backend fans out relevant events:

Without multiplexing:
  User in 500 channels → 500 topic subscriptions on pub/sub

With multiplexing:
  User in 500 channels → 1 topic subscription ("user:456")
  Backend filters to relevant events before publishing

At 1M users each in ~100 channels, this reduces pub/sub subscription count from 100M to 1M — a 100x reduction in subscription overhead.

The Metrics That Matter

For a WebSocket gateway, traditional HTTP metrics aren't sufficient:

| Metric | What It Measures | |--------|-----------------| | Active connections | Current connection count — tracks scaling needs | | Connection churn rate | Connects/disconnects per second — high churn indicates client issues | | Message delivery latency | Time from backend publish to client receive | | Missed messages | Events delivered after client disconnected and reconnected | | Heartbeat failure rate | Indicator of network quality issues | | Connection age distribution | Are connections long-lived? (healthy) or constantly reconnecting? (problematic) |

The Pattern to Take Away

WebSocket gateways are a class of infrastructure, not a product feature. The patterns that make them work at scale:

Event loop, not thread-per-connection — handle tens of thousands of connections per thread
Stateless gateway + stateful pub/sub — gateway knows about connections; business logic knows about users and events
Heartbeating — detect dead connections before they accumulate
Graceful drain — deploy without reconnection storms
Jitter on reconnect — prevent thundering herd after outages
Connection multiplexing — reduce fan-out by routing through user-level topics

These patterns apply whether you're building a chat system, a live collaboration tool, a financial data feed, or any other real-time system with persistent connections.