Learnixo
Back to blog
AI Systemsbeginner

Matrix Operations in Deep Learning

The core matrix operations that power neural networks — matrix multiplication, broadcasting, batch operations, and how they map to PyTorch.

Asma Hafeez KhanMay 21, 20265 min read
Deep LearningLinear AlgebraMatrix MultiplicationPyTorchInterview
Share:š•

The Core Operation: Matrix Multiplication

A neural network forward pass is fundamentally a sequence of matrix multiplications:

Layer forward pass:
  Z = X @ W.T + b

Where:
  X: input batch,    shape (batch_size, n_in)
  W: weight matrix,  shape (n_out, n_in)
  b: bias vector,    shape (n_out,)
  Z: pre-activation, shape (batch_size, n_out)

Matrix multiply X @ W.T:
  (batch_size, n_in) @ (n_in, n_out) → (batch_size, n_out)
  
  Each row of Z is the dot product of one input with all neurons' weights.
  All examples in the batch are processed in parallel.

Why Matrix Operations Matter

For a batch of 256 examples with 512 inputs and 1024 neurons:
  X: (256, 512)
  W: (1024, 512)
  
  Operations: 256 Ɨ 512 Ɨ 1024 = 134M multiply-adds
  
  Sequential (for loop): ~seconds on CPU
  Matrix multiply on GPU: ~0.1ms
  
  The GPU parallelises all 134M operations simultaneously.
  This is why GPUs are essential and why batch processing is efficient.

NumPy Matrix Operations

Python
import numpy as np

# Matrix multiplication
A = np.array([[1, 2], [3, 4], [5, 6]])   # shape (3, 2)
B = np.array([[1, 0], [0, 1]])            # shape (2, 2)

C = A @ B                                  # (3, 2) — matrix multiply
# or: np.matmul(A, B)
print(f"Shape: {C.shape}")   # (3, 2)

# Transpose
W = np.random.randn(4, 3)   # weight matrix: 4 neurons, 3 inputs each
print(f"W shape: {W.shape}")       # (4, 3)
print(f"W.T shape: {W.T.shape}")   # (3, 4)

# Forward pass for a batch
batch_size = 8
X = np.random.randn(batch_size, 3)   # 8 examples, 3 features
b = np.zeros(4)

Z = X @ W.T + b   # (8, 3) @ (3, 4) + (4,) → (8, 4)
print(f"Pre-activation Z shape: {Z.shape}")   # (8, 4)

# Broadcasting: b is (4,) but Z is (8, 4)
# NumPy broadcasts b across the batch dimension automatically

PyTorch Tensor Operations

Python
import torch
import torch.nn as nn

# Basic tensor operations
x = torch.randn(3, 4)    # shape (3, 4)
y = torch.randn(4, 5)    # shape (4, 5)

z = x @ y                # matrix multiply: (3, 4) @ (4, 5) → (3, 5)
# or: torch.matmul(x, y)
# or: torch.mm(x, y)    — 2D only, no batch dimension

# Batched matrix multiply
A = torch.randn(32, 10, 64)   # batch=32, seq_len=10, d=64
B = torch.randn(32, 64, 20)   # batch=32, d=64, out=20

C = torch.bmm(A, B)           # (32, 10, 20) — batch matmul
# or: A @ B                    — same result (broadcasting-aware)

# Element-wise operations
a = torch.tensor([1.0, 2.0, 3.0])
b_vec = torch.tensor([4.0, 5.0, 6.0])
print(a * b_vec)   # [4, 10, 18] — element-wise product (Hadamard)
print(a + b_vec)   # [5, 7, 9]   — element-wise addition

# Dot product (inner product)
dot = (a * b_vec).sum()   # or torch.dot(a, b_vec) for 1D
print(f"Dot product: {dot}")  # 32.0

# Outer product
outer = torch.outer(a, b_vec)   # (3, 3) matrix
print(f"Outer product:\n{outer}")

Attention Is a Matrix Operation

The scaled dot-product attention that powers Transformers is pure matrix ops:

Python
import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    Q: torch.Tensor,    # queries:  (batch, n_heads, seq_q, d_k)
    K: torch.Tensor,    # keys:     (batch, n_heads, seq_k, d_k)
    V: torch.Tensor,    # values:   (batch, n_heads, seq_k, d_v)
    mask: torch.Tensor | None = None,
) -> torch.Tensor:
    d_k = Q.size(-1)
    
    # Step 1: Q @ K.T → attention scores
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    # shape: (batch, n_heads, seq_q, seq_k)
    
    # Step 2: Apply mask (optional)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    
    # Step 3: Softmax over seq_k dimension → attention weights
    weights = F.softmax(scores, dim=-1)
    # shape: (batch, n_heads, seq_q, seq_k)
    
    # Step 4: Weighted sum of values
    output = weights @ V
    # shape: (batch, n_heads, seq_q, d_v)
    
    return output, weights

# Attention is: output = softmax(Q @ K.T / √d_k) @ V
# A matrix multiply, softmax, then another matrix multiply

Key Shape Rules

Matrix multiply A @ B requires: A.shape[-1] == B.shape[-2]
  (n, k) @ (k, m) → (n, m)
  
  The "inner" dimensions must match.
  The "outer" dimensions determine output shape.

Broadcasting rules:
  Dimensions are aligned from the right
  Missing dimensions are treated as size 1
  Operations broadcast over size-1 dimensions
  
  Example: (8, 4) + (4,) → (4,) becomes (1, 4) → broadcast to (8, 4)

Common shapes in a Transformer:
  Input tokens: (batch, seq_len)
  Embeddings: (batch, seq_len, d_model)    e.g., (32, 512, 768)
  Q, K, V:    (batch, n_heads, seq_len, d_k)
  Output:     (batch, seq_len, d_model)

Interview Answer

"Neural network forward passes are sequences of matrix multiplications: Z = X @ W.T + b. For a batch of 256 examples with 512 inputs and 1024 output neurons, this is 134M multiply-add operations — parallelisable across GPU cores, which is why GPUs are essential. The key rule for matrix multiply: the inner dimensions must match — (n, k) @ (k, m) → (n, m). Broadcasting allows adding bias vectors to batch outputs automatically. In Transformers, attention is itself matrix multiplication: softmax(Q @ K.T / √d_k) @ V — everything reduces to highly optimised linear algebra routines. Understanding tensor shapes is fundamental to debugging neural network architectures."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.