Back to Case Studies
systemsadvanced 10 min read

Discord

Discord's Read States: Rewriting a Go Service in Rust

Eliminating latency spikes by moving from Go's GC to Rust's ownership model

Key outcome: 5x lower tail latency
RustGoPerformanceMemory ManagementBackend

The Problem

Discord's Read States service tracks one thing: which messages has each user already read? It sounds simple, but at Discord's scale, it's one of the most write-heavy services in the entire infrastructure — every time any user opens a channel, reads a message, or receives a notification, Read States updates.

At the time of the rewrite, Discord had 150 million users. The service processed millions of writes per second.

The team had already invested heavily in Go. The service worked. But periodically — roughly every few minutes — every user would experience a sudden latency spike of ~2 seconds. The culprit: Go's garbage collector.


Why Go's GC Caused Latency Spikes

Go uses a concurrent, tri-color mark-and-sweep garbage collector. Most of the time, GC runs in the background with low impact. But the Read States service had a specific workload that made the GC expensive:

  • Large LRU cache — millions of entries kept in memory for fast reads
  • Frequent small allocations — every state update allocated a new struct
  • The GC had to scan the entire cache on every collection cycle

Every few minutes, the GC would scan millions of live objects. During this scan, the runtime had to "stop the world" briefly for certain bookkeeping steps. At scale, this produced consistent 2-second p99 latency spikes — not a theoretical problem, but one users felt.

The team profiled extensively and confirmed the GC was the root cause. Options:

  1. Tune GC parameters (reduces frequency, doesn't eliminate spikes)
  2. Reduce cache size (hurts hit rate, increases DB load)
  3. Rewrite in a language without a GC

They chose option 3.


Why Rust?

Rust doesn't have a garbage collector. Memory is managed through ownership and borrowing — a compile-time system that guarantees memory safety without runtime overhead:

  • No GC pauses — memory is freed deterministically when values go out of scope
  • No dangling pointers — the borrow checker prevents use-after-free at compile time
  • No data races — ownership rules prevent two threads from mutating the same data simultaneously
  • Zero-cost abstractions — high-level code compiles to the same assembly as the manual equivalent

The trade-off: Rust has a steep learning curve. The borrow checker rejects code patterns that feel natural in other languages.

Discord already had some Rust experience from other services, and the team judged the rewrite worth the investment.


The Rewrite

The Rust service was a near-direct port of the Go service's logic. Key decisions:

Data Structures

Rust
// The per-channel read state: last read message ID + mention count
struct ReadState {
    last_message_id: u64,
    mention_count: u32,
}

// LRU cache: user_id -> HashMap<channel_id, ReadState>
// Using a Rust LRU crate instead of implementing from scratch
type UserStateCache = LruCache<u64, HashMap<u64, ReadState>>;

Async Runtime: Tokio

The Go service used goroutines for concurrency. The Rust service used Tokio — Rust's async runtime:

Rust
#[tokio::main]
async fn main() {
    let cache = Arc::new(Mutex::new(UserStateCache::new(CACHE_SIZE)));

    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();

    loop {
        let (socket, _) = listener.accept().await.unwrap();
        let cache = Arc::clone(&cache);
        tokio::spawn(async move {
            handle_connection(socket, cache).await;
        });
    }
}

Arc<Mutex<T>> is Rust's thread-safe reference-counted smart pointer wrapping a mutex — the equivalent of what Go handles implicitly through its runtime.

Memory Management

In Go, the LRU cache entries are heap-allocated objects managed by the GC. In Rust, the LruCache knows the exact layout and lifetime of every entry. When an entry is evicted, its memory is freed immediately — no GC needed.

Rust
// When this function returns, `entry` is dropped — memory freed immediately
fn evict_oldest(cache: &mut UserStateCache) {
    if let Some((_key, entry)) = cache.pop_lru() {
        // entry is dropped here — memory returned to allocator
        // no GC pause, no deferred cleanup
        drop(entry);
    }
}

The Results

| Metric | Go Service | Rust Service | |--------|-----------|--------------| | Average latency | ~500µs | ~500µs | | p99 latency (normal) | ~10ms | ~1ms | | p99 latency (GC spike) | ~2,000ms | ~1ms | | Memory usage | Higher (GC overhead) | Lower | | CPU usage | Higher | Lower |

The headline numbers: 5x improvement in p99 latency and elimination of the periodic 2-second spikes. Average latency was similar — Go was never slow on average.


What Discord Learned

1. The Problem Was Predictable Memory Pressure

The GC spikes weren't random. They were triggered by the specific combination of a large live cache and frequent small allocations. Any language with a tracing GC (Java, Go, Python, C#) would have exhibited similar behavior under this workload.

2. Rust's Learning Curve Is Real, But Bounded

The engineers who wrote the rewrite report that the first ~2 weeks fighting the borrow checker were frustrating. After that, the patterns became intuitive. The key insight: the borrow checker rejects code that would be silently wrong in other languages.

3. The Right Tool for the Right Job

Discord didn't rewrite everything in Rust. They chose Rust specifically for services with:

  • Predictable, latency-sensitive workloads
  • Large in-memory caches or data structures
  • High write throughput that stresses the GC

Most of Discord's services stay in Elixir or Go. Read States was an outlier.

4. Elixir Was the Alternative

Discord's primary backend language is Elixir (on the BEAM VM). They briefly considered porting Read States to Elixir, but the BEAM's actor model is optimised for distributed message passing, not large shared in-memory caches with constant mutation.


The Key Trade-Off Table

| Dimension | Go | Rust | |-----------|-----|------| | GC pauses | Yes — periodic | No | | Memory safety | Runtime checks + GC | Compile-time only | | Learning curve | Low | High (borrow checker) | | Ecosystem | Large, mature | Growing, strong for systems | | Best for | General-purpose backend | Latency-critical, GC-sensitive |


The Pattern to Take Away

Predictable latency requires predictable memory management.

When a service has strict p99/p999 latency requirements AND holds a large live heap, a GC becomes a liability. The GC's occasional full-scan pauses — however short in absolute time — become visible to users at high enough percentile.

The solution space:

  • Reduce heap pressure (smaller caches, explicit pooling)
  • Tune the GC (GOGC in Go, -Xmx in JVM)
  • Move to GC-free language (Rust, C, C++)
  • Accept the trade-off (if p99 requirements aren't strict)

Discord's workload made options 1 and 2 insufficient. Option 3 was the right call.


Further Reading

  • Discord Engineering Blog: "Why Discord is switching from Go to Rust" (2020)
  • Rust tokio async runtime documentation
  • Course: Systems Programming concepts

Related Case Studies

Go Deeper

Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.

Explore Courses