Tensors Explained
What tensors are, how they generalise scalars, vectors, and matrices, tensor shapes in deep learning, and common PyTorch tensor operations.
What a Tensor Is
A tensor is a generalisation of scalars, vectors, and matrices to arbitrary dimensions:
Rank 0 (scalar): a single number shape: ()
loss = 0.42
Rank 1 (vector): a 1D array shape: (n,)
embedding = [0.12, -0.34, 0.89, ...] shape: (768,)
Rank 2 (matrix): a 2D array shape: (m, n)
weight matrix W shape: (512, 768)
Rank 3 (tensor): a 3D array shape: (d1, d2, d3)
batch of embeddings shape: (32, 512, 768)
(batch_size, seq_len, d_model)
Rank 4 (tensor): a 4D array shape: (d1, d2, d3, d4)
batch of images shape: (32, 3, 224, 224)
(batch_size, channels, height, width)PyTorch Tensor Creation
import torch
import numpy as np
# Creation
a = torch.tensor([1.0, 2.0, 3.0]) # from list, infers dtype
b = torch.zeros(3, 4) # (3, 4) of zeros
c = torch.ones(2, 5, dtype=torch.float16) # float16
d = torch.randn(8, 512) # N(0,1) random
e = torch.arange(0, 10, step=2) # [0, 2, 4, 6, 8]
f = torch.linspace(0, 1, steps=5) # [0.0, 0.25, 0.5, 0.75, 1.0]
# From NumPy (shares memory — no copy)
arr = np.array([[1.0, 2.0], [3.0, 4.0]])
t = torch.from_numpy(arr)
# Move to GPU
if torch.cuda.is_available():
device = torch.device("cuda")
d_gpu = d.to(device)
d_gpu = d.cuda() # equivalent
# Dtype control
x_float32 = torch.randn(3, dtype=torch.float32)
x_float16 = x_float32.half() # float16
x_bfloat16 = x_float32.bfloat16() # bfloat16 (better for training)
x_int32 = x_float32.int()Shape Operations
x = torch.randn(32, 512, 768) # (batch, seq_len, d_model)
# Shape inspection
print(x.shape) # torch.Size([32, 512, 768])
print(x.ndim) # 3
print(x.dtype) # torch.float32
print(x.device) # device(type='cpu') or device(type='cuda', index=0)
print(x.numel()) # 32 * 512 * 768 = 12,582,912
# Reshape
y = x.view(32, -1) # (32, 512*768) = (32, 393216) — contiguous only
y = x.reshape(32, -1) # (32, 393216) — works always
# Transpose / permute
x_T = x.transpose(1, 2) # swap dims 1 and 2 → (32, 768, 512)
x_P = x.permute(0, 2, 1) # same as above
# Squeeze and unsqueeze
a = torch.randn(32, 1, 768)
b = a.squeeze(1) # remove dim 1 → (32, 768)
c = b.unsqueeze(0) # add dim 0 → (1, 32, 768)
# Stack and concatenate
a = torch.randn(32, 256)
b = torch.randn(32, 256)
cat_col = torch.cat([a, b], dim=1) # (32, 512) — concat along features
cat_row = torch.cat([a, b], dim=0) # (64, 256) — concat along batch
stacked = torch.stack([a, b], dim=0) # (2, 32, 256) — new dimensionBroadcasting
# Broadcasting: operations between tensors with compatible shapes
# Shapes are aligned from the right, size-1 dims expand automatically
a = torch.randn(32, 512) # (32, 512)
b = torch.randn(512) # (512,) — broadcast to (1, 512) then (32, 512)
c = a + b # works! → (32, 512)
# Common in neural networks: add bias to batched output
batch_output = torch.randn(8, 256) # (batch, d)
bias = torch.zeros(256) # (d,)
biased = batch_output + bias # (8, 256) — bias added to each row
# Attention mask broadcasting
scores = torch.randn(8, 12, 100, 100) # (batch, heads, seq, seq)
mask = torch.ones(1, 1, 100, 100) # (1, 1, seq, seq)
masked = scores + mask # broadcasts across batch and headsCommon Tensor Operations in DL
# Reduction operations
x = torch.randn(32, 512)
x.mean() # scalar: mean of all elements
x.mean(dim=0) # (512,): mean across batch dimension
x.mean(dim=1) # (32,): mean across feature dimension
x.mean(dim=1, keepdim=True) # (32, 1): keeps dimension
x.sum(dim=-1) # sum along last dimension
x.max(dim=1).values # max along dim 1
x.argmax(dim=1) # index of max along dim 1
# Softmax
logits = torch.randn(8, 10) # (batch, n_classes)
probs = torch.softmax(logits, dim=-1) # (batch, n_classes), sums to 1 per row
# Matrix multiply
A = torch.randn(32, 128)
B = torch.randn(128, 64)
C = A @ B # (32, 64)
C = torch.matmul(A, B) # same
# Element-wise
x * y # element-wise product (Hadamard)
x + y # element-wise add
x.pow(2) # element-wise square
x.sqrt() # element-wise sqrt
# Norm
l2_norm = x.norm(p=2, dim=-1) # L2 norm along last dim
x_normalised = x / (l2_norm.unsqueeze(-1) + 1e-8)GPU Tensor Operations
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Everything must be on the same device
model = model.to(device)
X = X.to(device)
y = y.to(device)
# Check device
print(X.device) # cuda:0
# Memory management
torch.cuda.empty_cache() # free cached memory
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
# Detach from computation graph (for inference or numpy conversion)
with torch.no_grad():
pred = model(X) # no gradient tracking
arr = pred.detach().cpu().numpy() # to numpyInterview Answer
"A tensor is a multi-dimensional array generalising scalars (rank 0), vectors (rank 1), matrices (rank 2) to arbitrary rank. In deep learning, tensors represent batches of data: a batch of images is rank-4 (batch, channels, height, width); a batch of token embeddings is rank-3 (batch, seq_len, d_model). The critical operations are: reshape/view (change dimensions without data copy), permute (reorder dimensions, essential for attention), broadcasting (implicit dimension expansion for element-wise ops), and reductions (mean, sum, max across dimensions). In PyTorch, all gradients flow through tensor operations — the computation graph is built dynamically during the forward pass, enabling backpropagation via autograd."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.