Learnixo

Deep Learning for AI Interviews · Lesson 4 of 56

Compute, Data, and Scale Requirements for DL

Why GPUs?

Neural network training = massive matrix multiplications

CPU: general purpose, few powerful cores (8–64)
  Good at: sequential code, branching, complex logic
  Bad at: millions of small parallel operations

GPU: thousands of weaker cores (NVIDIA H100: 16,896 CUDA cores)
  Good at: massively parallel matrix operations
  Bad at: complex sequential logic

Training a ResNet-50 on ImageNet:
  CPU only:  weeks
  V100 GPU:  hours
  H100 GPU:  ~40 minutes

The matrix multiply at the core of neural network forward/backward passes
is embarrassingly parallelisable → GPU is essential.

GPU Memory

Model parameters consume GPU memory:
  Parameter count × bytes per parameter = memory

  GPT-2 (117M params) at float32: 117M × 4 bytes = 468MB
  LLaMA 7B at float16: 7B × 2 bytes = 14GB
  LLaMA 70B at float16: 70B × 2 bytes = 140GB → needs multiple GPUs

During training, also need:
  Activations for backward pass: often 2–4× model size
  Gradients: same size as model
  Optimiser states (Adam): 2× model size (m and v)
  
  Rule of thumb: training needs 4–6× model parameter memory
  LLaMA 7B training: ~56–84GB GPU memory (needs A100 80GB or multi-GPU)

During inference (serving), just need:
  Model parameters
  KV cache (for LLMs)
  LLaMA 7B inference at float16: ~14GB (fits in A100 40GB)

Common GPU Classes

Consumer / entry-level:
  NVIDIA RTX 4090: 24GB VRAM, ~83 TFLOPS FP16
  Cost: ~$1500–2000
  Good for: fine-tuning small models, local inference

Research / mid-tier:
  NVIDIA A100 40GB: 40GB VRAM, 312 TFLOPS FP16
  Cost: ~$10,000 (purchase), $2–4/hour (cloud)
  Good for: pre-training medium models, research

High-end / production:
  NVIDIA H100 80GB: 80GB VRAM, 1979 TFLOPS FP16
  Cost: ~$30,000–40,000, $6–8/hour (cloud)
  Good for: LLM training, production inference at scale

Multi-GPU configurations:
  8× H100 node: 640GB total VRAM, NVLink high-bandwidth interconnect
  Used for: training LLMs (LLaMA 70B requires ~4–8 H100s just for inference)

Training Time Estimates

Python
def estimate_training_time(
    model_params: int,        # number of parameters
    dataset_size: int,        # number of training examples
    batch_size: int,
    epochs: int,
    gpu_tflops: float,        # GPU peak TFLOPS for your precision
    efficiency: float = 0.4,  # GPU utilisation (40% is typical)
) -> dict:
    """Rough training time estimate."""
    
    # FLOPs per forward+backward pass  6 × model_params per token/example
    flops_per_step = 6 * model_params * batch_size
    
    steps_per_epoch = dataset_size // batch_size
    total_steps = steps_per_epoch * epochs
    total_flops = flops_per_step * total_steps
    
    effective_tflops = gpu_tflops * efficiency * 1e12
    training_seconds = total_flops / effective_tflops
    training_hours   = training_seconds / 3600
    
    return {
        "total_steps": total_steps,
        "total_flops": f"{total_flops:.2e}",
        "estimated_hours": round(training_hours, 1),
    }

# Example: fine-tuning a 7B parameter model
result = estimate_training_time(
    model_params=7_000_000_000,
    dataset_size=100_000,
    batch_size=8,
    epochs=3,
    gpu_tflops=312,  # A100 FP16
)
print(result)  # roughly 20–40 hours on single A100

Cloud vs On-Premise

Cloud (AWS, GCP, Azure):
  + No upfront cost
  + Access to latest GPUs (H100) on demand
  + Pay per use
  - Expensive for continuous workloads ($4–8/hour × 1000 hours = $4000–8000)
  - Data sovereignty concerns for healthcare (PHI)

On-premise GPU cluster:
  + Cost-effective for continuous training
  + Data stays on-site (HIPAA compliance easier)
  - High upfront cost ($30K–1M+)
  - Maintenance overhead
  
Clinical AI guideline: PHI processing should remain on-premise or in
a HIPAA-compliant cloud environment (AWS GovCloud, Azure Government)

Reducing Compute Requirements

Python
# 1. Mixed precision (float16 instead of float32)
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# ~ speedup, ~ memory reduction, minimal accuracy loss

# 2. Gradient accumulation (simulate large batch on limited GPU memory)
accumulation_steps = 4   # effective batch = batch_size × 4
optimizer.zero_grad()
for i, (inputs, targets) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, targets) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# 3. LoRA (train only 0.1–1% of parameters)
from peft import LoraConfig, get_peft_model
config = LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1, task_type="CAUSAL_LM")
model = get_peft_model(model, config)
# 7B model with LoRA: trains only ~4M params instead of 7B

# 4. Gradient checkpointing (recompute activations instead of storing them)
model.gradient_checkpointing_enable()  # 30–50% memory saving, ~30% slower

Interview Answer

"Deep learning requires GPUs because neural network training is dominated by matrix multiplications — massively parallel operations that GPUs handle with thousands of cores, orders of magnitude faster than CPUs. Memory is the key constraint: a 7B parameter model at float16 occupies 14GB of VRAM, and training needs 4–6× more for activations, gradients, and optimiser states. Common optimisations: mixed precision (float16 saves 2× memory and 2× compute), gradient accumulation (simulate large batch on small GPU), LoRA (fine-tune only 0.1% of parameters), and gradient checkpointing (trade compute for memory). For clinical AI with PHI, cloud GPU instances must be HIPAA-compliant (AWS GovCloud, Azure Government) or training must happen on-premise."