Deep Learning for AI Interviews · Lesson 16 of 56

Tensors: The Data Structure of Deep Learning

What a Tensor Is

A tensor is a generalisation of scalars, vectors, and matrices to arbitrary dimensions:

Rank 0 (scalar):   a single number         shape: ()
  loss = 0.42

Rank 1 (vector):   a 1D array              shape: (n,)
  embedding = [0.12, -0.34, 0.89, ...]     shape: (768,)

Rank 2 (matrix):   a 2D array              shape: (m, n)
  weight matrix W                           shape: (512, 768)

Rank 3 (tensor):   a 3D array              shape: (d1, d2, d3)
  batch of embeddings                       shape: (32, 512, 768)
  (batch_size, seq_len, d_model)

Rank 4 (tensor):   a 4D array              shape: (d1, d2, d3, d4)
  batch of images                           shape: (32, 3, 224, 224)
  (batch_size, channels, height, width)

PyTorch Tensor Creation

Python

import torch
import numpy as np

# Creation
a = torch.tensor([1.0, 2.0, 3.0])          # from list, infers dtype
b = torch.zeros(3, 4)                        # (3, 4) of zeros
c = torch.ones(2, 5, dtype=torch.float16)   # float16
d = torch.randn(8, 512)                      # N(0,1) random
e = torch.arange(0, 10, step=2)             # [0, 2, 4, 6, 8]
f = torch.linspace(0, 1, steps=5)           # [0.0, 0.25, 0.5, 0.75, 1.0]

# From NumPy (shares memory — no copy)
arr = np.array([[1.0, 2.0], [3.0, 4.0]])
t = torch.from_numpy(arr)

# Move to GPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    d_gpu = d.to(device)
    d_gpu = d.cuda()   # equivalent

# Dtype control
x_float32 = torch.randn(3, dtype=torch.float32)
x_float16  = x_float32.half()    # float16
x_bfloat16 = x_float32.bfloat16()  # bfloat16 (better for training)
x_int32    = x_float32.int()

Shape Operations

Python

x = torch.randn(32, 512, 768)   # (batch, seq_len, d_model)

# Shape inspection
print(x.shape)     # torch.Size([32, 512, 768])
print(x.ndim)      # 3
print(x.dtype)     # torch.float32
print(x.device)    # device(type='cpu') or device(type='cuda', index=0)
print(x.numel())   # 32 * 512 * 768 = 12,582,912

# Reshape
y = x.view(32, -1)           # (32, 512*768) = (32, 393216)  — contiguous only
y = x.reshape(32, -1)        # (32, 393216)  — works always

# Transpose / permute
x_T = x.transpose(1, 2)      # swap dims 1 and 2 → (32, 768, 512)
x_P = x.permute(0, 2, 1)     # same as above

# Squeeze and unsqueeze
a = torch.randn(32, 1, 768)
b = a.squeeze(1)              # remove dim 1 → (32, 768)
c = b.unsqueeze(0)            # add dim 0 → (1, 32, 768)

# Stack and concatenate
a = torch.randn(32, 256)
b = torch.randn(32, 256)

cat_col = torch.cat([a, b], dim=1)    # (32, 512)  — concat along features
cat_row = torch.cat([a, b], dim=0)    # (64, 256)  — concat along batch
stacked = torch.stack([a, b], dim=0)  # (2, 32, 256)  — new dimension

Broadcasting

Python

# Broadcasting: operations between tensors with compatible shapes
# Shapes are aligned from the right, size-1 dims expand automatically

a = torch.randn(32, 512)   # (32, 512)
b = torch.randn(512)       # (512,) — broadcast to (1, 512) then (32, 512)

c = a + b   # works! → (32, 512)

# Common in neural networks: add bias to batched output
batch_output = torch.randn(8, 256)   # (batch, d)
bias = torch.zeros(256)              # (d,)
biased = batch_output + bias          # (8, 256) — bias added to each row

# Attention mask broadcasting
scores = torch.randn(8, 12, 100, 100)   # (batch, heads, seq, seq)
mask = torch.ones(1, 1, 100, 100)       # (1, 1, seq, seq)
masked = scores + mask   # broadcasts across batch and heads

Common Tensor Operations in DL

Python

# Reduction operations
x = torch.randn(32, 512)
x.mean()              # scalar: mean of all elements
x.mean(dim=0)         # (512,): mean across batch dimension
x.mean(dim=1)         # (32,): mean across feature dimension
x.mean(dim=1, keepdim=True)   # (32, 1): keeps dimension

x.sum(dim=-1)         # sum along last dimension
x.max(dim=1).values   # max along dim 1
x.argmax(dim=1)       # index of max along dim 1

# Softmax
logits = torch.randn(8, 10)    # (batch, n_classes)
probs = torch.softmax(logits, dim=-1)   # (batch, n_classes), sums to 1 per row

# Matrix multiply
A = torch.randn(32, 128)
B = torch.randn(128, 64)
C = A @ B              # (32, 64)
C = torch.matmul(A, B) # same

# Element-wise
x * y    # element-wise product (Hadamard)
x + y    # element-wise add
x.pow(2) # element-wise square
x.sqrt() # element-wise sqrt

# Norm
l2_norm = x.norm(p=2, dim=-1)        # L2 norm along last dim
x_normalised = x / (l2_norm.unsqueeze(-1) + 1e-8)

GPU Tensor Operations

Python

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Everything must be on the same device
model = model.to(device)
X = X.to(device)
y = y.to(device)

# Check device
print(X.device)  # cuda:0

# Memory management
torch.cuda.empty_cache()   # free cached memory
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

# Detach from computation graph (for inference or numpy conversion)
with torch.no_grad():
    pred = model(X)             # no gradient tracking
arr = pred.detach().cpu().numpy()  # to numpy

Interview Answer

"A tensor is a multi-dimensional array generalising scalars (rank 0), vectors (rank 1), matrices (rank 2) to arbitrary rank. In deep learning, tensors represent batches of data: a batch of images is rank-4 (batch, channels, height, width); a batch of token embeddings is rank-3 (batch, seq_len, d_model). The critical operations are: reshape/view (change dimensions without data copy), permute (reorder dimensions, essential for attention), broadcasting (implicit dimension expansion for element-wise ops), and reductions (mean, sum, max across dimensions). In PyTorch, all gradients flow through tensor operations — the computation graph is built dynamically during the forward pass, enabling backpropagation via autograd."

PyTorch vs TensorFlow: Interview Perspective

Next Lesson

Why GPUs Accelerate Deep Learning