Deep Learning for AI Interviews · Lesson 9 of 56

Matrix Operations Behind Every Layer

The Core Operation: Matrix Multiplication

A neural network forward pass is fundamentally a sequence of matrix multiplications:

Layer forward pass:
  Z = X @ W.T + b

Where:
  X: input batch,    shape (batch_size, n_in)
  W: weight matrix,  shape (n_out, n_in)
  b: bias vector,    shape (n_out,)
  Z: pre-activation, shape (batch_size, n_out)

Matrix multiply X @ W.T:
  (batch_size, n_in) @ (n_in, n_out) → (batch_size, n_out)
  
  Each row of Z is the dot product of one input with all neurons' weights.
  All examples in the batch are processed in parallel.

Why Matrix Operations Matter

For a batch of 256 examples with 512 inputs and 1024 neurons:
  X: (256, 512)
  W: (1024, 512)
  
  Operations: 256 × 512 × 1024 = 134M multiply-adds
  
  Sequential (for loop): ~seconds on CPU
  Matrix multiply on GPU: ~0.1ms
  
  The GPU parallelises all 134M operations simultaneously.
  This is why GPUs are essential and why batch processing is efficient.

NumPy Matrix Operations

Python

import numpy as np

# Matrix multiplication
A = np.array([[1, 2], [3, 4], [5, 6]])   # shape (3, 2)
B = np.array([[1, 0], [0, 1]])            # shape (2, 2)

C = A @ B                                  # (3, 2) — matrix multiply
# or: np.matmul(A, B)
print(f"Shape: {C.shape}")   # (3, 2)

# Transpose
W = np.random.randn(4, 3)   # weight matrix: 4 neurons, 3 inputs each
print(f"W shape: {W.shape}")       # (4, 3)
print(f"W.T shape: {W.T.shape}")   # (3, 4)

# Forward pass for a batch
batch_size = 8
X = np.random.randn(batch_size, 3)   # 8 examples, 3 features
b = np.zeros(4)

Z = X @ W.T + b   # (8, 3) @ (3, 4) + (4,) → (8, 4)
print(f"Pre-activation Z shape: {Z.shape}")   # (8, 4)

# Broadcasting: b is (4,) but Z is (8, 4)
# NumPy broadcasts b across the batch dimension automatically

PyTorch Tensor Operations

Python

import torch
import torch.nn as nn

# Basic tensor operations
x = torch.randn(3, 4)    # shape (3, 4)
y = torch.randn(4, 5)    # shape (4, 5)

z = x @ y                # matrix multiply: (3, 4) @ (4, 5) → (3, 5)
# or: torch.matmul(x, y)
# or: torch.mm(x, y)    — 2D only, no batch dimension

# Batched matrix multiply
A = torch.randn(32, 10, 64)   # batch=32, seq_len=10, d=64
B = torch.randn(32, 64, 20)   # batch=32, d=64, out=20

C = torch.bmm(A, B)           # (32, 10, 20) — batch matmul
# or: A @ B                    — same result (broadcasting-aware)

# Element-wise operations
a = torch.tensor([1.0, 2.0, 3.0])
b_vec = torch.tensor([4.0, 5.0, 6.0])
print(a * b_vec)   # [4, 10, 18] — element-wise product (Hadamard)
print(a + b_vec)   # [5, 7, 9]   — element-wise addition

# Dot product (inner product)
dot = (a * b_vec).sum()   # or torch.dot(a, b_vec) for 1D
print(f"Dot product: {dot}")  # 32.0

# Outer product
outer = torch.outer(a, b_vec)   # (3, 3) matrix
print(f"Outer product:\n{outer}")

Attention Is a Matrix Operation

The scaled dot-product attention that powers Transformers is pure matrix ops:

Python

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    Q: torch.Tensor,    # queries:  (batch, n_heads, seq_q, d_k)
    K: torch.Tensor,    # keys:     (batch, n_heads, seq_k, d_k)
    V: torch.Tensor,    # values:   (batch, n_heads, seq_k, d_v)
    mask: torch.Tensor | None = None,
) -> torch.Tensor:
    d_k = Q.size(-1)
    
    # Step 1: Q @ K.T → attention scores
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    # shape: (batch, n_heads, seq_q, seq_k)
    
    # Step 2: Apply mask (optional)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    
    # Step 3: Softmax over seq_k dimension → attention weights
    weights = F.softmax(scores, dim=-1)
    # shape: (batch, n_heads, seq_q, seq_k)
    
    # Step 4: Weighted sum of values
    output = weights @ V
    # shape: (batch, n_heads, seq_q, d_v)
    
    return output, weights

# Attention is: output = softmax(Q @ K.T / √d_k) @ V
# A matrix multiply, softmax, then another matrix multiply

Key Shape Rules

Matrix multiply A @ B requires: A.shape[-1] == B.shape[-2]
  (n, k) @ (k, m) → (n, m)
  
  The "inner" dimensions must match.
  The "outer" dimensions determine output shape.

Broadcasting rules:
  Dimensions are aligned from the right
  Missing dimensions are treated as size 1
  Operations broadcast over size-1 dimensions
  
  Example: (8, 4) + (4,) → (4,) becomes (1, 4) → broadcast to (8, 4)

Common shapes in a Transformer:
  Input tokens: (batch, seq_len)
  Embeddings: (batch, seq_len, d_model)    e.g., (32, 512, 768)
  Q, K, V:    (batch, n_heads, seq_len, d_k)
  Output:     (batch, seq_len, d_model)

Interview Answer

"Neural network forward passes are sequences of matrix multiplications: Z = X @ W.T + b. For a batch of 256 examples with 512 inputs and 1024 output neurons, this is 134M multiply-add operations — parallelisable across GPU cores, which is why GPUs are essential. The key rule for matrix multiply: the inner dimensions must match — (n, k) @ (k, m) → (n, m). Broadcasting allows adding bias vectors to batch outputs automatically. In Transformers, attention is itself matrix multiplication: softmax(Q @ K.T / √d_k) @ V — everything reduces to highly optimised linear algebra routines. Understanding tensor shapes is fundamental to debugging neural network architectures."

Weight Initialization: Why It Matters

Next Lesson

Overfitting in Deep Networks