What Is Fine-Tuning?

Fine-tuning is the process of continuing to train a pre-trained language model on a smaller, domain-specific dataset so that the model specializes in a particular task, style, or knowledge area. The base model's weights — learned from billions of tokens of internet text — are updated through additional gradient descent steps on your curated data.

The key insight: you are not training from scratch. You are redirecting a model that already understands language, reasoning, and world knowledge toward your specific use case.

The Three Levels of LLM Adaptation

When you need a language model to do something specific, you have three main tools. Think of them as an escalating ladder of investment.

Level 1 — Prompting

You write a better prompt. No training, no data pipeline, no infrastructure. Just words.

Python

# Pure prompting — no fine-tuning needed
from openai import OpenAI

client = OpenAI()

system_prompt = """You are a pharmaceutical drug information specialist.
Answer questions about drug interactions, dosing, and contraindications.
Always cite the drug class and mechanism of action.
Format your answer as:
  Drug: <name>
  Class: <class>
  Answer: <answer>
  Warning: <any relevant caution>"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is metformin used for?"}
    ]
)
print(response.choices[0].message.content)

When prompting is enough:

General tasks the model already handles well
Prototyping or exploration phase
Latency and cost are not critical constraints
Output format flexibility is acceptable

When prompting breaks down:

The model ignores your format instructions under pressure
You need very domain-specific vocabulary or reasoning
You're paying for 2,000-token system prompts on every call
Consistency across thousands of calls is mandatory

Level 2 — Retrieval-Augmented Generation (RAG)

You keep the model frozen but inject relevant documents at inference time. The model reads your private knowledge base in the prompt.

Python

# RAG pipeline — no fine-tuning, but requires retrieval infrastructure
from openai import OpenAI
import numpy as np

client = OpenAI()

# Simplified retrieval step
def retrieve_drug_info(query: str, drug_database: list[dict]) -> str:
    """Embed query and find nearest drug document."""
    # In production: use a vector database like Pinecone or Weaviate
    # Here: keyword match for illustration
    query_lower = query.lower()
    for doc in drug_database:
        if any(term in query_lower for term in doc["keywords"]):
            return doc["content"]
    return "No specific information found."

drug_database = [
    {
        "keywords": ["metformin", "diabetes", "glucose"],
        "content": (
            "Metformin (Glucophage) is a biguanide antidiabetic agent. "
            "First-line therapy for type 2 diabetes. Reduces hepatic glucose production. "
            "Dose: 500-2000 mg/day. Contraindicated in severe renal impairment (eGFR under 30)."
        )
    }
]

def answer_with_rag(question: str) -> str:
    context = retrieve_drug_info(question, drug_database)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using ONLY the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )
    return response.choices[0].message.content

print(answer_with_rag("What is metformin and when should I avoid it?"))

RAG is best when:

Your knowledge base updates frequently
You need citations and source attribution
The knowledge does not fit in a model's weights (e.g., 500,000 drug records)
You cannot retrain a model on a schedule that matches data updates

Level 3 — Fine-Tuning

You update the model's weights. The domain knowledge, style, and reasoning patterns are baked into the model itself.

Python

# Fine-tuning via Hugging Face Trainer (simplified)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
import torch

model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Your domain-specific training data
training_examples = [
    {
        "text": (
            "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n"
            "What is metformin?\n"
            "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
            "Metformin is a biguanide antidiabetic drug used as first-line therapy "
            "for type 2 diabetes mellitus. It reduces hepatic glucose production via "
            "AMPK activation and does not cause hypoglycemia as monotherapy.\n"
            "<|eot_id|>"
        )
    },
    # ... hundreds more examples
]

dataset = Dataset.from_list(training_examples)

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=512,
        padding="max_length"
    )

tokenized_dataset = dataset.map(tokenize)

training_args = TrainingArguments(
    output_dir="./drug-llm-checkpoint",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    logging_steps=10,
    save_steps=100,
    fp16=False,
    bf16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

What Fine-Tuning Actually Changes

This is critical to understand: fine-tuning changes weights, not architecture.

The transformer architecture — number of layers, attention heads, hidden dimensions, positional encoding — stays exactly the same. What changes are the numerical values stored in the weight matrices.

Pre-trained weights (frozen):
  W_q = [[0.234, -0.891, ...], ...]   ← learned from 2 trillion tokens

After fine-tuning:
  W_q = [[0.231, -0.887, ...], ...]   ← nudged by your 1,000 domain examples

The nudge is small for PEFT methods like LoRA. For full fine-tuning, every parameter shifts. The model that comes out is still a transformer — it just "thinks differently" because its weights encode your domain.

The Gradient Flow During Fine-Tuning

Python

# Conceptual view of what happens during fine-tuning
import torch
import torch.nn as nn

# Simplified linear layer (represents attention projection)
layer = nn.Linear(768, 768)

# Before fine-tuning
weight_before = layer.weight.data.clone()

# Forward pass on your domain data
optimizer = torch.optim.AdamW(layer.parameters(), lr=2e-5)
domain_input = torch.randn(4, 768)   # batch of 4 tokens
target = torch.randn(4, 768)         # what we want the layer to output

output = layer(domain_input)
loss = nn.MSELoss()(output, target)

# Backward pass — this is what fine-tuning is
loss.backward()
optimizer.step()

weight_after = layer.weight.data

# The difference is tiny but meaningful
delta = (weight_after - weight_before).abs().mean().item()
print(f"Average weight change: {delta:.6f}")
# Typical output: Average weight change: 0.000021
# Multiplied across 7 billion parameters — meaningful shift

Pre-Training vs Fine-Tuning vs Prompting: The Real Differences

| Dimension | Pre-Training | Fine-Tuning | Prompting | |---|---|---|---| | Data required | Trillions of tokens | Hundreds to tens of thousands | Zero | | Compute cost | Millions of dollars | Hundreds to thousands of dollars | Per-call API cost | | Changes weights? | Yes (from scratch) | Yes (from checkpoint) | No | | Persists after session? | Yes | Yes | No | | Latency overhead | None | None | System prompt tokens | | Updates knowledge | Creates knowledge | Adapts knowledge | Retrieves at runtime |

The Real Cost of Fine-Tuning

Fine-tuning is not free. You pay in three currencies: compute, data, and time.

Compute Cost

Full fine-tuning of Llama 3 8B:
  - GPU memory needed: ~80 GB (model + gradients + optimizer states)
  - Hardware: 1x A100 80GB or 2x A100 40GB
  - Time for 1,000 examples, 3 epochs: ~2 hours
  - Cost on AWS p4d.xlarge: ~$6/hour → ~$12 per run

LoRA fine-tuning of Llama 3 8B (rank 16):
  - GPU memory needed: ~18 GB
  - Hardware: 1x RTX 4090 (24 GB) or 1x A10G
  - Time for 1,000 examples, 3 epochs: ~45 minutes
  - Cost on Lambda Labs A10G: ~$0.75/hour → ~$0.56 per run

QLoRA fine-tuning of Llama 3 70B (rank 16):
  - GPU memory needed: ~48 GB (4-bit quantized base + adapters)
  - Hardware: 1x A100 80GB
  - Time for 1,000 examples, 3 epochs: ~4 hours
  - Cost on AWS: ~$32 per run

Data Cost

Python

# Estimating data collection cost
def estimate_data_cost(
    num_examples: int,
    minutes_per_example: float,
    hourly_rate_usd: float,
    llm_generation_fraction: float = 0.0
) -> dict:
    """
    Estimate the cost of building a fine-tuning dataset.

    Args:
        num_examples: Target dataset size
        minutes_per_example: Human review time per example
        hourly_rate_usd: Cost of human reviewer ($/hour)
        llm_generation_fraction: Fraction generated by LLM (rest is human-written)
    """
    human_examples = int(num_examples * (1 - llm_generation_fraction))
    llm_examples = num_examples - human_examples

    human_hours = (human_examples * minutes_per_example) / 60
    human_cost = human_hours * hourly_rate_usd

    # GPT-4o at ~$5 per million output tokens, ~200 tokens per example
    llm_cost = llm_examples * 200 * (5 / 1_000_000)

    return {
        "human_examples": human_examples,
        "llm_examples": llm_examples,
        "human_cost_usd": round(human_cost, 2),
        "llm_cost_usd": round(llm_cost, 2),
        "total_cost_usd": round(human_cost + llm_cost, 2),
        "total_hours": round(human_hours, 1),
    }

# Example: 1,000 drug QA pairs, 50% human-written, 50% LLM-generated
cost = estimate_data_cost(
    num_examples=1000,
    minutes_per_example=5,
    hourly_rate_usd=30,
    llm_generation_fraction=0.5
)
print(cost)
# {
#   'human_examples': 500, 'llm_examples': 500,
#   'human_cost_usd': 1250.0, 'llm_cost_usd': 0.5,
#   'total_cost_usd': 1250.5, 'total_hours': 41.7
# }

Time Cost

Fine-tuning is not instantaneous. Factor in:

Data collection and cleaning — often the longest phase (days to weeks)
Training runs — hours per experiment
Evaluation — human review of model outputs (hours)
Iteration — you rarely get it right on the first run

A realistic timeline for a production fine-tuning project:

Week 1: Define task, collect initial 200 examples, establish eval criteria
Week 2: Run baseline experiments, identify data gaps
Week 3: Expand dataset to 1,000+ examples, run hyperparameter sweep
Week 4: Human evaluation, regression testing, deployment

Summary

Fine-tuning sits in the middle of the adaptation spectrum. It is more powerful than prompting — the domain knowledge is baked into the weights, not read from context — but cheaper than pre-training from scratch because you start from an already capable foundation.

Use fine-tuning when:

Prompting cannot reliably produce the required output format or style
You need consistent behavior across millions of calls without paying per-prompt token costs
Your domain requires vocabulary or reasoning that the base model does not handle well
You have at least several hundred high-quality examples

In the next lessons, you will learn how to decide precisely when fine-tuning is justified, and then dive into the mechanics of how it works.

What Is Fine-Tuning?

What Is Fine-Tuning?

The Three Levels of LLM Adaptation

Level 1 — Prompting

Level 2 — Retrieval-Augmented Generation (RAG)

Level 3 — Fine-Tuning

What Fine-Tuning Actually Changes

The Gradient Flow During Fine-Tuning

Pre-Training vs Fine-Tuning vs Prompting: The Real Differences

The Real Cost of Fine-Tuning

Compute Cost

Data Cost

Time Cost

Summary

Enjoyed this article?

Leave a comment