Deep Learning for AI Interviews · Lesson 4 of 56
Compute, Data, and Scale Requirements for DL
Why GPUs?
Neural network training = massive matrix multiplications
CPU: general purpose, few powerful cores (8–64)
Good at: sequential code, branching, complex logic
Bad at: millions of small parallel operations
GPU: thousands of weaker cores (NVIDIA H100: 16,896 CUDA cores)
Good at: massively parallel matrix operations
Bad at: complex sequential logic
Training a ResNet-50 on ImageNet:
CPU only: weeks
V100 GPU: hours
H100 GPU: ~40 minutes
The matrix multiply at the core of neural network forward/backward passes
is embarrassingly parallelisable → GPU is essential.GPU Memory
Model parameters consume GPU memory:
Parameter count × bytes per parameter = memory
GPT-2 (117M params) at float32: 117M × 4 bytes = 468MB
LLaMA 7B at float16: 7B × 2 bytes = 14GB
LLaMA 70B at float16: 70B × 2 bytes = 140GB → needs multiple GPUs
During training, also need:
Activations for backward pass: often 2–4× model size
Gradients: same size as model
Optimiser states (Adam): 2× model size (m and v)
Rule of thumb: training needs 4–6× model parameter memory
LLaMA 7B training: ~56–84GB GPU memory (needs A100 80GB or multi-GPU)
During inference (serving), just need:
Model parameters
KV cache (for LLMs)
LLaMA 7B inference at float16: ~14GB (fits in A100 40GB)Common GPU Classes
Consumer / entry-level:
NVIDIA RTX 4090: 24GB VRAM, ~83 TFLOPS FP16
Cost: ~$1500–2000
Good for: fine-tuning small models, local inference
Research / mid-tier:
NVIDIA A100 40GB: 40GB VRAM, 312 TFLOPS FP16
Cost: ~$10,000 (purchase), $2–4/hour (cloud)
Good for: pre-training medium models, research
High-end / production:
NVIDIA H100 80GB: 80GB VRAM, 1979 TFLOPS FP16
Cost: ~$30,000–40,000, $6–8/hour (cloud)
Good for: LLM training, production inference at scale
Multi-GPU configurations:
8× H100 node: 640GB total VRAM, NVLink high-bandwidth interconnect
Used for: training LLMs (LLaMA 70B requires ~4–8 H100s just for inference)Training Time Estimates
def estimate_training_time(
model_params: int, # number of parameters
dataset_size: int, # number of training examples
batch_size: int,
epochs: int,
gpu_tflops: float, # GPU peak TFLOPS for your precision
efficiency: float = 0.4, # GPU utilisation (40% is typical)
) -> dict:
"""Rough training time estimate."""
# FLOPs per forward+backward pass ≈ 6 × model_params per token/example
flops_per_step = 6 * model_params * batch_size
steps_per_epoch = dataset_size // batch_size
total_steps = steps_per_epoch * epochs
total_flops = flops_per_step * total_steps
effective_tflops = gpu_tflops * efficiency * 1e12
training_seconds = total_flops / effective_tflops
training_hours = training_seconds / 3600
return {
"total_steps": total_steps,
"total_flops": f"{total_flops:.2e}",
"estimated_hours": round(training_hours, 1),
}
# Example: fine-tuning a 7B parameter model
result = estimate_training_time(
model_params=7_000_000_000,
dataset_size=100_000,
batch_size=8,
epochs=3,
gpu_tflops=312, # A100 FP16
)
print(result) # roughly 20–40 hours on single A100Cloud vs On-Premise
Cloud (AWS, GCP, Azure):
+ No upfront cost
+ Access to latest GPUs (H100) on demand
+ Pay per use
- Expensive for continuous workloads ($4–8/hour × 1000 hours = $4000–8000)
- Data sovereignty concerns for healthcare (PHI)
On-premise GPU cluster:
+ Cost-effective for continuous training
+ Data stays on-site (HIPAA compliance easier)
- High upfront cost ($30K–1M+)
- Maintenance overhead
Clinical AI guideline: PHI processing should remain on-premise or in
a HIPAA-compliant cloud environment (AWS GovCloud, Azure Government)Reducing Compute Requirements
# 1. Mixed precision (float16 instead of float32)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# ~2× speedup, ~2× memory reduction, minimal accuracy loss
# 2. Gradient accumulation (simulate large batch on limited GPU memory)
accumulation_steps = 4 # effective batch = batch_size × 4
optimizer.zero_grad()
for i, (inputs, targets) in enumerate(dataloader):
outputs = model(inputs)
loss = criterion(outputs, targets) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# 3. LoRA (train only 0.1–1% of parameters)
from peft import LoraConfig, get_peft_model
config = LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1, task_type="CAUSAL_LM")
model = get_peft_model(model, config)
# 7B model with LoRA: trains only ~4M params instead of 7B
# 4. Gradient checkpointing (recompute activations instead of storing them)
model.gradient_checkpointing_enable() # 30–50% memory saving, ~30% slowerInterview Answer
"Deep learning requires GPUs because neural network training is dominated by matrix multiplications — massively parallel operations that GPUs handle with thousands of cores, orders of magnitude faster than CPUs. Memory is the key constraint: a 7B parameter model at float16 occupies 14GB of VRAM, and training needs 4–6× more for activations, gradients, and optimiser states. Common optimisations: mixed precision (float16 saves 2× memory and 2× compute), gradient accumulation (simulate large batch on small GPU), LoRA (fine-tune only 0.1% of parameters), and gradient checkpointing (trade compute for memory). For clinical AI with PHI, cloud GPU instances must be HIPAA-compliant (AWS GovCloud, Azure Government) or training must happen on-premise."