Matrix Operations in Deep Learning
The core matrix operations that power neural networks ā matrix multiplication, broadcasting, batch operations, and how they map to PyTorch.
The Core Operation: Matrix Multiplication
A neural network forward pass is fundamentally a sequence of matrix multiplications:
Layer forward pass:
Z = X @ W.T + b
Where:
X: input batch, shape (batch_size, n_in)
W: weight matrix, shape (n_out, n_in)
b: bias vector, shape (n_out,)
Z: pre-activation, shape (batch_size, n_out)
Matrix multiply X @ W.T:
(batch_size, n_in) @ (n_in, n_out) ā (batch_size, n_out)
Each row of Z is the dot product of one input with all neurons' weights.
All examples in the batch are processed in parallel.Why Matrix Operations Matter
For a batch of 256 examples with 512 inputs and 1024 neurons:
X: (256, 512)
W: (1024, 512)
Operations: 256 Ć 512 Ć 1024 = 134M multiply-adds
Sequential (for loop): ~seconds on CPU
Matrix multiply on GPU: ~0.1ms
The GPU parallelises all 134M operations simultaneously.
This is why GPUs are essential and why batch processing is efficient.NumPy Matrix Operations
import numpy as np
# Matrix multiplication
A = np.array([[1, 2], [3, 4], [5, 6]]) # shape (3, 2)
B = np.array([[1, 0], [0, 1]]) # shape (2, 2)
C = A @ B # (3, 2) ā matrix multiply
# or: np.matmul(A, B)
print(f"Shape: {C.shape}") # (3, 2)
# Transpose
W = np.random.randn(4, 3) # weight matrix: 4 neurons, 3 inputs each
print(f"W shape: {W.shape}") # (4, 3)
print(f"W.T shape: {W.T.shape}") # (3, 4)
# Forward pass for a batch
batch_size = 8
X = np.random.randn(batch_size, 3) # 8 examples, 3 features
b = np.zeros(4)
Z = X @ W.T + b # (8, 3) @ (3, 4) + (4,) ā (8, 4)
print(f"Pre-activation Z shape: {Z.shape}") # (8, 4)
# Broadcasting: b is (4,) but Z is (8, 4)
# NumPy broadcasts b across the batch dimension automaticallyPyTorch Tensor Operations
import torch
import torch.nn as nn
# Basic tensor operations
x = torch.randn(3, 4) # shape (3, 4)
y = torch.randn(4, 5) # shape (4, 5)
z = x @ y # matrix multiply: (3, 4) @ (4, 5) ā (3, 5)
# or: torch.matmul(x, y)
# or: torch.mm(x, y) ā 2D only, no batch dimension
# Batched matrix multiply
A = torch.randn(32, 10, 64) # batch=32, seq_len=10, d=64
B = torch.randn(32, 64, 20) # batch=32, d=64, out=20
C = torch.bmm(A, B) # (32, 10, 20) ā batch matmul
# or: A @ B ā same result (broadcasting-aware)
# Element-wise operations
a = torch.tensor([1.0, 2.0, 3.0])
b_vec = torch.tensor([4.0, 5.0, 6.0])
print(a * b_vec) # [4, 10, 18] ā element-wise product (Hadamard)
print(a + b_vec) # [5, 7, 9] ā element-wise addition
# Dot product (inner product)
dot = (a * b_vec).sum() # or torch.dot(a, b_vec) for 1D
print(f"Dot product: {dot}") # 32.0
# Outer product
outer = torch.outer(a, b_vec) # (3, 3) matrix
print(f"Outer product:\n{outer}")Attention Is a Matrix Operation
The scaled dot-product attention that powers Transformers is pure matrix ops:
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(
Q: torch.Tensor, # queries: (batch, n_heads, seq_q, d_k)
K: torch.Tensor, # keys: (batch, n_heads, seq_k, d_k)
V: torch.Tensor, # values: (batch, n_heads, seq_k, d_v)
mask: torch.Tensor | None = None,
) -> torch.Tensor:
d_k = Q.size(-1)
# Step 1: Q @ K.T ā attention scores
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
# shape: (batch, n_heads, seq_q, seq_k)
# Step 2: Apply mask (optional)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
# Step 3: Softmax over seq_k dimension ā attention weights
weights = F.softmax(scores, dim=-1)
# shape: (batch, n_heads, seq_q, seq_k)
# Step 4: Weighted sum of values
output = weights @ V
# shape: (batch, n_heads, seq_q, d_v)
return output, weights
# Attention is: output = softmax(Q @ K.T / ād_k) @ V
# A matrix multiply, softmax, then another matrix multiplyKey Shape Rules
Matrix multiply A @ B requires: A.shape[-1] == B.shape[-2]
(n, k) @ (k, m) ā (n, m)
The "inner" dimensions must match.
The "outer" dimensions determine output shape.
Broadcasting rules:
Dimensions are aligned from the right
Missing dimensions are treated as size 1
Operations broadcast over size-1 dimensions
Example: (8, 4) + (4,) ā (4,) becomes (1, 4) ā broadcast to (8, 4)
Common shapes in a Transformer:
Input tokens: (batch, seq_len)
Embeddings: (batch, seq_len, d_model) e.g., (32, 512, 768)
Q, K, V: (batch, n_heads, seq_len, d_k)
Output: (batch, seq_len, d_model)Interview Answer
"Neural network forward passes are sequences of matrix multiplications: Z = X @ W.T + b. For a batch of 256 examples with 512 inputs and 1024 output neurons, this is 134M multiply-add operations ā parallelisable across GPU cores, which is why GPUs are essential. The key rule for matrix multiply: the inner dimensions must match ā (n, k) @ (k, m) ā (n, m). Broadcasting allows adding bias vectors to batch outputs automatically. In Transformers, attention is itself matrix multiplication: softmax(Q @ K.T / ād_k) @ V ā everything reduces to highly optimised linear algebra routines. Understanding tensor shapes is fundamental to debugging neural network architectures."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.