Sliding Window for Token Processing

Why Sliding Window Matters for AI Engineers

The sliding window pattern is central to how LLMs and RAG pipelines handle text. Chunking a document into overlapping segments is a sliding window. Finding the longest passage where all sentences are relevant is a variable-size window. Processing token streams without buffering the full input is a sliding window.

If you can implement a sliding window cleanly under interview pressure, you demonstrate both algorithmic fluency and domain understanding.

The Pattern in One Mental Model

A sliding window maintains a subset of a sequence — a "window" — and moves it forward. The key insight: instead of recomputing from scratch each time, you maintain state by:

Adding the new element entering from the right
Removing the element leaving from the left
Checking or updating your answer with the current window state

Array:  [a, b, c, d, e, f, g]
Window size 3:
  Step 0: [a, b, c]          → check/update answer
  Step 1:    [b, c, d]       → remove a, add d
  Step 2:       [c, d, e]    → remove b, add e
  ...

Fixed-Size Sliding Window

Classic Problem: Maximum Sum Subarray of Size k

Python

def max_sum_subarray_k(nums: list[int], k: int) -> int:
    """
    Find the maximum sum of any k consecutive elements.
    
    Time: O(n)   — one pass through the array
    Space: O(1)  — only track current and max sum
    
    Interview signal: immediately see this as sliding window,
    not nested loops (O(n*k)).
    """
    if not nums or k <= 0 or k > len(nums):
        return 0
    
    # Initialize first window
    window_sum = sum(nums[:k])
    max_sum = window_sum
    
    # Slide window one step at a time
    for i in range(k, len(nums)):
        # Add incoming element, remove outgoing element
        window_sum += nums[i] - nums[i - k]
        max_sum = max(max_sum, window_sum)
    
    return max_sum

assert max_sum_subarray_k([1, 4, 2, 9, 7, 3, 8], 3) == 19  # [7, 3, 8] = 19? No: max is [2,9,7]=18 or [9,7,3]=19
assert max_sum_subarray_k([2, 1, 5, 1, 3, 2], 3) == 9       # [5, 1, 3] — wait: [2,1,5]=8, [1,5,1]=7, [5,1,3]=9 ✓
assert max_sum_subarray_k([], 3) == 0
assert max_sum_subarray_k([1, 2], 5) == 0  # k > len

AI Application: Chunking Text with Overlap

In RAG systems, documents are split into chunks before embedding. A naive split at fixed character/token boundaries loses context — a chunk might start mid-sentence. Overlapping windows preserve context at boundaries.

Python

def chunk_text_fixed_size(
    text: str,
    chunk_size: int = 512,
    overlap: int = 64,
) -> list[str]:
    """
    Split text into overlapping chunks of chunk_size characters.
    
    The overlap ensures that information at chunk boundaries
    appears in two adjacent chunks — important for RAG because
    a relevant sentence might span a boundary.
    
    Time: O(n) where n = len(text)
    Space: O(n) for the output chunks
    
    Parameters:
        chunk_size: size of each chunk in characters
        overlap: number of characters repeated between consecutive chunks
    """
    if not text:
        return []
    if len(text) <= chunk_size:
        return [text]
    
    chunks = []
    start = 0
    stride = chunk_size - overlap  # how far to advance each step
    
    if stride <= 0:
        raise ValueError(f"overlap ({overlap}) must be less than chunk_size ({chunk_size})")
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        
        # If we've reached the end, stop
        if end >= len(text):
            break
        
        start += stride  # slide forward by stride = chunk_size - overlap
    
    return chunks


def chunk_tokens_fixed_size(
    tokens: list[str],
    chunk_size: int = 100,
    overlap: int = 20,
) -> list[list[str]]:
    """
    Token-level chunking — more precise than character-level
    since LLM context limits are measured in tokens.
    
    Time: O(n)
    Space: O(n * overlap / stride) — overlap causes duplication
    """
    if not tokens:
        return []
    if len(tokens) <= chunk_size:
        return [tokens[:]]  # return a copy
    
    chunks = []
    stride = chunk_size - overlap
    
    if stride <= 0:
        raise ValueError(f"overlap ({overlap}) must be less than chunk_size ({chunk_size})")
    
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(tokens[start:end])
        if end == len(tokens):
            break
        start += stride
    
    return chunks


# Test chunking
sample_text = "The quick brown fox jumps over the lazy dog. " * 20
chunks = chunk_text_fixed_size(sample_text, chunk_size=100, overlap=20)
print(f"Input length: {len(sample_text)} chars")
print(f"Number of chunks: {len(chunks)}")
print(f"Chunk 0 ends with: ...{repr(chunks[0][-30:])}")
print(f"Chunk 1 starts with: {repr(chunks[1][:30])}...")
# The overlap should be visible: end of chunk 0 == start of chunk 1
assert chunks[0][-20:] == chunks[1][:20], "Overlap verification failed"
print("Overlap verified: last 20 chars of chunk 0 == first 20 chars of chunk 1")

Variable-Size Sliding Window

Variable-size windows expand and shrink based on a condition. The pattern uses two pointers: left and right.

Python

def longest_substring_without_repeating(s: str) -> int:
    """
    Classic variable-size window: longest substring with all unique chars.
    
    Expand right to add chars, shrink left when we find a duplicate.
    
    Time: O(n) — each character enters and leaves the window once
    Space: O(min(n, alphabet_size))
    """
    seen: dict[str, int] = {}  # char → last seen index
    left = 0
    max_length = 0
    
    for right, char in enumerate(s):
        # If char was seen and is inside our current window
        if char in seen and seen[char] >= left:
            # Move left past the previous occurrence
            left = seen[char] + 1
        
        seen[char] = right
        max_length = max(max_length, right - left + 1)
    
    return max_length

assert longest_substring_without_repeating("abcabcbb") == 3  # "abc"
assert longest_substring_without_repeating("bbbbb") == 1     # "b"
assert longest_substring_without_repeating("pwwkew") == 3    # "wke"
assert longest_substring_without_repeating("") == 0

AI Application: Context Window Management

When a user has a multi-turn conversation with an LLM, you must manage the context window. If the history is too long, you need to trim it while preserving as much recent context as possible.

Python

from dataclasses import dataclass, field

@dataclass
class Message:
    role: str   # "user" | "assistant" | "system"
    content: str
    token_count: int = 0

    def __post_init__(self):
        if self.token_count == 0:
            # Rough approximation: 1 token ≈ 4 characters
            self.token_count = max(1, len(self.content) // 4)


def trim_conversation_to_context_limit(
    messages: list[Message],
    max_tokens: int = 4096,
    system_message: Message | None = None,
) -> list[Message]:
    """
    Given a conversation history, return the largest suffix of messages
    that fits within max_tokens. Always preserves the system message.
    
    This is a variable-size sliding window from the right:
    - Start with all messages
    - Remove oldest messages (from the left) until total tokens fit
    
    Time: O(n)
    Space: O(n)
    """
    reserved_tokens = system_message.token_count if system_message else 0
    available_tokens = max_tokens - reserved_tokens
    
    if available_tokens <= 0:
        return []
    
    # Start from the right (most recent) and work backward
    total_tokens = 0
    cutoff_index = len(messages)
    
    for i in range(len(messages) - 1, -1, -1):
        if total_tokens + messages[i].token_count > available_tokens:
            cutoff_index = i + 1
            break
        total_tokens += messages[i].token_count
    
    trimmed = messages[cutoff_index:]
    
    if system_message:
        return [system_message] + trimmed
    return trimmed


# Test
history = [
    Message("user", "Tell me about transformers.", 7),
    Message("assistant", "Transformers are neural network architectures...", 100),
    Message("user", "What is attention?", 5),
    Message("assistant", "Attention allows the model to focus on relevant parts...", 150),
    Message("user", "How does self-attention work?", 7),
    Message("assistant", "In self-attention, each token attends to all other tokens...", 200),
    Message("user", "What is the complexity?", 6),
]

system_msg = Message("system", "You are a helpful AI tutor.", 8)
trimmed = trim_conversation_to_context_limit(history, max_tokens=300, system_message=system_msg)
print(f"Original messages: {len(history)}")
print(f"Trimmed to: {len(trimmed)} messages (incl system)")
for msg in trimmed:
    print(f"  [{msg.role}] ({msg.token_count} tokens): {msg.content[:50]}...")

Implementing a Sliding Window Chunker with Overlap

This is a complete implementation that handles edge cases — exactly the kind interviewers expect.

Python

from typing import Iterator

def sliding_window_chunker(
    text: str,
    chunk_size: int,
    overlap: int,
    strip_whitespace: bool = True,
) -> Iterator[dict]:
    """
    Production-quality sliding window text chunker.
    
    Yields dicts with:
      - text: the chunk content
      - start: character offset in original text
      - end: character offset in original text
      - chunk_index: sequential chunk number
    
    Edge cases handled:
      - Empty text → yields nothing
      - Text shorter than chunk_size → yields the full text as one chunk
      - overlap >= chunk_size → raises ValueError
      - Last chunk may be shorter than chunk_size
    
    Time: O(n)
    Space: O(chunk_size) — generator, doesn't hold all chunks in memory
    """
    if overlap >= chunk_size:
        raise ValueError(
            f"overlap ({overlap}) must be strictly less than chunk_size ({chunk_size})"
        )
    
    if not text:
        return
    
    if strip_whitespace:
        text = text.strip()
    
    n = len(text)
    stride = chunk_size - overlap
    chunk_index = 0
    start = 0
    
    while start < n:
        end = min(start + chunk_size, n)
        chunk_text = text[start:end]
        
        if strip_whitespace:
            chunk_text = chunk_text.strip()
        
        if chunk_text:  # skip empty chunks after stripping
            yield {
                "text": chunk_text,
                "start": start,
                "end": end,
                "chunk_index": chunk_index,
                "is_last": end >= n,
            }
            chunk_index += 1
        
        if end >= n:
            break
        
        start += stride


# Test cases
print("=== Test 1: Normal chunking ===")
text = "ABCDEFGHIJ"
for chunk in sliding_window_chunker(text, chunk_size=4, overlap=1):
    print(chunk)
# Expected: ABC D, BCD E, CDE F... no wait: stride = 4-1 = 3
# start=0: ABCD, start=3: DEFG, start=6: GHIJ

print("\n=== Test 2: Empty text ===")
chunks = list(sliding_window_chunker("", chunk_size=10, overlap=2))
assert chunks == [], f"Expected [], got {chunks}"
print("Empty text test passed")

print("\n=== Test 3: Text shorter than chunk_size ===")
chunks = list(sliding_window_chunker("short", chunk_size=100, overlap=10))
assert len(chunks) == 1
assert chunks[0]["text"] == "short"
print("Short text test passed")

print("\n=== Test 4: Overlap verification ===")
text = "ABCDEFGHIJKLMNOP"
chunk_size, overlap = 6, 2
chunks = list(sliding_window_chunker(text, chunk_size=chunk_size, overlap=overlap))
for i in range(len(chunks) - 1):
    current_end = chunks[i]["text"][-overlap:]
    next_start = chunks[i+1]["text"][:overlap]
    # Note: overlap at text level, not always exact due to stripping
    print(f"Chunk {i} end: {repr(current_end)}, Chunk {i+1} start: {repr(next_start)}")

print("\n=== Test 5: ValueError on bad input ===")
try:
    list(sliding_window_chunker("test", chunk_size=5, overlap=5))
    assert False, "Should have raised ValueError"
except ValueError as e:
    print(f"Correctly raised ValueError: {e}")

Sentence-Aware Chunking

A smarter chunker respects sentence boundaries. It's still a sliding window, but the window unit is sentences rather than characters.

Python

import re
from typing import Generator

def sentence_tokenize(text: str) -> list[str]:
    """
    Simple sentence splitter using regex.
    For production use nltk.sent_tokenize() or spacy.
    """
    # Split on period/exclamation/question mark followed by space and capital letter
    pattern = r'(?<=[.!?])\s+(?=[A-Z])'
    sentences = re.split(pattern, text.strip())
    return [s.strip() for s in sentences if s.strip()]


def sentence_aware_chunker(
    text: str,
    max_tokens_per_chunk: int = 200,
    overlap_sentences: int = 1,
    tokens_per_word_estimate: float = 1.3,
) -> list[dict]:
    """
    Chunk text by sentences so no sentence is split mid-way.
    
    The window slides over sentences, not characters.
    Overlap is measured in sentences, not characters.
    
    Time: O(n) — linear in number of sentences
    Space: O(n) for output
    """
    sentences = sentence_tokenize(text)
    if not sentences:
        return []
    
    # Estimate token count per sentence (rough: words * 1.3)
    def estimate_tokens(sentence: str) -> int:
        return max(1, int(len(sentence.split()) * tokens_per_word_estimate))
    
    token_counts = [estimate_tokens(s) for s in sentences]
    
    chunks = []
    chunk_index = 0
    i = 0
    
    while i < len(sentences):
        # Build a chunk starting at sentence i
        chunk_sentences = []
        chunk_tokens = 0
        j = i
        
        while j < len(sentences):
            if chunk_tokens + token_counts[j] > max_tokens_per_chunk and chunk_sentences:
                # This sentence would exceed the limit — stop here
                break
            chunk_sentences.append(sentences[j])
            chunk_tokens += token_counts[j]
            j += 1
        
        if not chunk_sentences:
            # Single sentence exceeds limit — include it anyway to avoid infinite loop
            chunk_sentences = [sentences[i]]
            j = i + 1
        
        chunks.append({
            "text": " ".join(chunk_sentences),
            "sentence_start": i,
            "sentence_end": j,
            "chunk_index": chunk_index,
            "estimated_tokens": chunk_tokens,
        })
        chunk_index += 1
        
        # Advance, keeping overlap_sentences from the end of this chunk
        advance = max(1, (j - i) - overlap_sentences)
        i += advance
    
    return chunks


# Test
doc = (
    "The transformer architecture was introduced in 2017. "
    "It uses self-attention mechanisms to process sequences. "
    "Unlike RNNs, transformers process tokens in parallel. "
    "This makes them much faster to train on modern hardware. "
    "The key innovation is the attention mechanism. "
    "Attention allows each token to directly interact with any other token. "
    "This removes the bottleneck of sequential processing. "
    "BERT and GPT are both based on the transformer architecture. "
    "BERT uses bidirectional attention while GPT uses causal attention. "
    "Modern LLMs like GPT-4 are scaled-up transformers."
)

chunks = sentence_aware_chunker(doc, max_tokens_per_chunk=30, overlap_sentences=1)
print(f"Document sentences: {len(sentence_tokenize(doc))}")
print(f"Chunks produced: {len(chunks)}")
for c in chunks:
    print(f"\nChunk {c['chunk_index']} (sentences {c['sentence_start']}-{c['sentence_end']}, ~{c['estimated_tokens']} tokens):")
    print(f"  {c['text'][:100]}...")

Interview Talking Points

When you see a chunking or windowing problem, say this before coding:

"This looks like a sliding window. I'll identify whether it's fixed-size or variable-size. Fixed-size windows move by a constant stride; variable-size windows use two pointers and expand/shrink based on a condition. The key invariant I'll maintain is [state in the window], and I'll update it in O(1) per step to keep the overall algorithm O(n)."

Then ask: "Should I handle overlapping chunks? What should happen at the last chunk if the text doesn't divide evenly?" These questions show you've thought about the problem, not just the happy path.

Sliding Window for Token Processing

Why Sliding Window Matters for AI Engineers

The Pattern in One Mental Model

Fixed-Size Sliding Window

Classic Problem: Maximum Sum Subarray of Size k

AI Application: Chunking Text with Overlap

Variable-Size Sliding Window

AI Application: Context Window Management

Implementing a Sliding Window Chunker with Overlap

Sentence-Aware Chunking

Interview Talking Points

Enjoyed this article?

Leave a comment