What Is Fine-Tuning?
Understand fine-tuning at the conceptual level ā what it changes, what it costs, and how it fits into the LLM adaptation toolkit alongside prompting and RAG.
What Is Fine-Tuning?
Fine-tuning is the process of continuing to train a pre-trained language model on a smaller, domain-specific dataset so that the model specializes in a particular task, style, or knowledge area. The base model's weights ā learned from billions of tokens of internet text ā are updated through additional gradient descent steps on your curated data.
The key insight: you are not training from scratch. You are redirecting a model that already understands language, reasoning, and world knowledge toward your specific use case.
The Three Levels of LLM Adaptation
When you need a language model to do something specific, you have three main tools. Think of them as an escalating ladder of investment.
Level 1 ā Prompting
You write a better prompt. No training, no data pipeline, no infrastructure. Just words.
# Pure prompting ā no fine-tuning needed
from openai import OpenAI
client = OpenAI()
system_prompt = """You are a pharmaceutical drug information specialist.
Answer questions about drug interactions, dosing, and contraindications.
Always cite the drug class and mechanism of action.
Format your answer as:
Drug: <name>
Class: <class>
Answer: <answer>
Warning: <any relevant caution>"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What is metformin used for?"}
]
)
print(response.choices[0].message.content)When prompting is enough:
- General tasks the model already handles well
- Prototyping or exploration phase
- Latency and cost are not critical constraints
- Output format flexibility is acceptable
When prompting breaks down:
- The model ignores your format instructions under pressure
- You need very domain-specific vocabulary or reasoning
- You're paying for 2,000-token system prompts on every call
- Consistency across thousands of calls is mandatory
Level 2 ā Retrieval-Augmented Generation (RAG)
You keep the model frozen but inject relevant documents at inference time. The model reads your private knowledge base in the prompt.
# RAG pipeline ā no fine-tuning, but requires retrieval infrastructure
from openai import OpenAI
import numpy as np
client = OpenAI()
# Simplified retrieval step
def retrieve_drug_info(query: str, drug_database: list[dict]) -> str:
"""Embed query and find nearest drug document."""
# In production: use a vector database like Pinecone or Weaviate
# Here: keyword match for illustration
query_lower = query.lower()
for doc in drug_database:
if any(term in query_lower for term in doc["keywords"]):
return doc["content"]
return "No specific information found."
drug_database = [
{
"keywords": ["metformin", "diabetes", "glucose"],
"content": (
"Metformin (Glucophage) is a biguanide antidiabetic agent. "
"First-line therapy for type 2 diabetes. Reduces hepatic glucose production. "
"Dose: 500-2000 mg/day. Contraindicated in severe renal impairment (eGFR under 30)."
)
}
]
def answer_with_rag(question: str) -> str:
context = retrieve_drug_info(question, drug_database)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using ONLY the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
print(answer_with_rag("What is metformin and when should I avoid it?"))RAG is best when:
- Your knowledge base updates frequently
- You need citations and source attribution
- The knowledge does not fit in a model's weights (e.g., 500,000 drug records)
- You cannot retrain a model on a schedule that matches data updates
Level 3 ā Fine-Tuning
You update the model's weights. The domain knowledge, style, and reasoning patterns are baked into the model itself.
# Fine-tuning via Hugging Face Trainer (simplified)
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
from datasets import Dataset
import torch
model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Your domain-specific training data
training_examples = [
{
"text": (
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n"
"What is metformin?\n"
"<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
"Metformin is a biguanide antidiabetic drug used as first-line therapy "
"for type 2 diabetes mellitus. It reduces hepatic glucose production via "
"AMPK activation and does not cause hypoglycemia as monotherapy.\n"
"<|eot_id|>"
)
},
# ... hundreds more examples
]
dataset = Dataset.from_list(training_examples)
def tokenize(example):
return tokenizer(
example["text"],
truncation=True,
max_length=512,
padding="max_length"
)
tokenized_dataset = dataset.map(tokenize)
training_args = TrainingArguments(
output_dir="./drug-llm-checkpoint",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
logging_steps=10,
save_steps=100,
fp16=False,
bf16=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()What Fine-Tuning Actually Changes
This is critical to understand: fine-tuning changes weights, not architecture.
The transformer architecture ā number of layers, attention heads, hidden dimensions, positional encoding ā stays exactly the same. What changes are the numerical values stored in the weight matrices.
Pre-trained weights (frozen):
W_q = [[0.234, -0.891, ...], ...] ā learned from 2 trillion tokens
After fine-tuning:
W_q = [[0.231, -0.887, ...], ...] ā nudged by your 1,000 domain examplesThe nudge is small for PEFT methods like LoRA. For full fine-tuning, every parameter shifts. The model that comes out is still a transformer ā it just "thinks differently" because its weights encode your domain.
The Gradient Flow During Fine-Tuning
# Conceptual view of what happens during fine-tuning
import torch
import torch.nn as nn
# Simplified linear layer (represents attention projection)
layer = nn.Linear(768, 768)
# Before fine-tuning
weight_before = layer.weight.data.clone()
# Forward pass on your domain data
optimizer = torch.optim.AdamW(layer.parameters(), lr=2e-5)
domain_input = torch.randn(4, 768) # batch of 4 tokens
target = torch.randn(4, 768) # what we want the layer to output
output = layer(domain_input)
loss = nn.MSELoss()(output, target)
# Backward pass ā this is what fine-tuning is
loss.backward()
optimizer.step()
weight_after = layer.weight.data
# The difference is tiny but meaningful
delta = (weight_after - weight_before).abs().mean().item()
print(f"Average weight change: {delta:.6f}")
# Typical output: Average weight change: 0.000021
# Multiplied across 7 billion parameters ā meaningful shiftPre-Training vs Fine-Tuning vs Prompting: The Real Differences
| Dimension | Pre-Training | Fine-Tuning | Prompting | |---|---|---|---| | Data required | Trillions of tokens | Hundreds to tens of thousands | Zero | | Compute cost | Millions of dollars | Hundreds to thousands of dollars | Per-call API cost | | Changes weights? | Yes (from scratch) | Yes (from checkpoint) | No | | Persists after session? | Yes | Yes | No | | Latency overhead | None | None | System prompt tokens | | Updates knowledge | Creates knowledge | Adapts knowledge | Retrieves at runtime |
The Real Cost of Fine-Tuning
Fine-tuning is not free. You pay in three currencies: compute, data, and time.
Compute Cost
Full fine-tuning of Llama 3 8B:
- GPU memory needed: ~80 GB (model + gradients + optimizer states)
- Hardware: 1x A100 80GB or 2x A100 40GB
- Time for 1,000 examples, 3 epochs: ~2 hours
- Cost on AWS p4d.xlarge: ~$6/hour ā ~$12 per run
LoRA fine-tuning of Llama 3 8B (rank 16):
- GPU memory needed: ~18 GB
- Hardware: 1x RTX 4090 (24 GB) or 1x A10G
- Time for 1,000 examples, 3 epochs: ~45 minutes
- Cost on Lambda Labs A10G: ~$0.75/hour ā ~$0.56 per run
QLoRA fine-tuning of Llama 3 70B (rank 16):
- GPU memory needed: ~48 GB (4-bit quantized base + adapters)
- Hardware: 1x A100 80GB
- Time for 1,000 examples, 3 epochs: ~4 hours
- Cost on AWS: ~$32 per runData Cost
# Estimating data collection cost
def estimate_data_cost(
num_examples: int,
minutes_per_example: float,
hourly_rate_usd: float,
llm_generation_fraction: float = 0.0
) -> dict:
"""
Estimate the cost of building a fine-tuning dataset.
Args:
num_examples: Target dataset size
minutes_per_example: Human review time per example
hourly_rate_usd: Cost of human reviewer ($/hour)
llm_generation_fraction: Fraction generated by LLM (rest is human-written)
"""
human_examples = int(num_examples * (1 - llm_generation_fraction))
llm_examples = num_examples - human_examples
human_hours = (human_examples * minutes_per_example) / 60
human_cost = human_hours * hourly_rate_usd
# GPT-4o at ~$5 per million output tokens, ~200 tokens per example
llm_cost = llm_examples * 200 * (5 / 1_000_000)
return {
"human_examples": human_examples,
"llm_examples": llm_examples,
"human_cost_usd": round(human_cost, 2),
"llm_cost_usd": round(llm_cost, 2),
"total_cost_usd": round(human_cost + llm_cost, 2),
"total_hours": round(human_hours, 1),
}
# Example: 1,000 drug QA pairs, 50% human-written, 50% LLM-generated
cost = estimate_data_cost(
num_examples=1000,
minutes_per_example=5,
hourly_rate_usd=30,
llm_generation_fraction=0.5
)
print(cost)
# {
# 'human_examples': 500, 'llm_examples': 500,
# 'human_cost_usd': 1250.0, 'llm_cost_usd': 0.5,
# 'total_cost_usd': 1250.5, 'total_hours': 41.7
# }Time Cost
Fine-tuning is not instantaneous. Factor in:
- Data collection and cleaning ā often the longest phase (days to weeks)
- Training runs ā hours per experiment
- Evaluation ā human review of model outputs (hours)
- Iteration ā you rarely get it right on the first run
A realistic timeline for a production fine-tuning project:
Week 1: Define task, collect initial 200 examples, establish eval criteria
Week 2: Run baseline experiments, identify data gaps
Week 3: Expand dataset to 1,000+ examples, run hyperparameter sweep
Week 4: Human evaluation, regression testing, deploymentSummary
Fine-tuning sits in the middle of the adaptation spectrum. It is more powerful than prompting ā the domain knowledge is baked into the weights, not read from context ā but cheaper than pre-training from scratch because you start from an already capable foundation.
Use fine-tuning when:
- Prompting cannot reliably produce the required output format or style
- You need consistent behavior across millions of calls without paying per-prompt token costs
- Your domain requires vocabulary or reasoning that the base model does not handle well
- You have at least several hundred high-quality examples
In the next lessons, you will learn how to decide precisely when fine-tuning is justified, and then dive into the mechanics of how it works.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.