Hugging Face Transformers — From Model Hub to Production
Complete Hugging Face guide — Model Hub, pipelines, tokenizers, fine-tuning with Trainer API, PEFT/LoRA for efficient fine-tuning, Inference API, and deploying models to production with Inference Endpoints.
Hugging Face is the GitHub of machine learning models — 500,000+ pre-trained models, 150,000+ datasets, and the transformers library that standardises how you load, run, and fine-tune them. Whether you need sentiment analysis, text generation, embeddings, or a custom fine-tuned LLM, Hugging Face is the starting point.
The Ecosystem
┌────────────────────────────────────────────────────────────────┐
│ Hugging Face Ecosystem │
│ │
│ Model Hub Datasets Spaces │
│ ────────── ──────── ────── │
│ 500k+ models 150k+ datasets Hosted demos │
│ Any framework Streaming-ready Gradio / Streamlit │
│ │
│ Libraries │
│ ───────── │
│ transformers Core: models, tokenizers, pipelines │
│ datasets Load and process datasets efficiently │
│ peft Parameter-efficient fine-tuning (LoRA) │
│ trl RLHF / SFT training │
│ accelerate Multi-GPU / distributed training │
│ evaluate Standardised metrics (BLEU, ROUGE, F1...) │
│ tokenizers Fast Rust-based tokenization │
└────────────────────────────────────────────────────────────────┘Pipelines — Zero-Code Inference
pipeline() is the fastest way to use any model — it handles tokenization, model forward pass, and post-processing:
from transformers import pipeline
# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This model is absolutely fantastic!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York and Paris.")
# [{'entity_group': 'ORG', 'word': 'Hugging Face', ...}, {'entity_group': 'LOC', ...}]
# Text generation
generator = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
output = generator(
"Explain the difference between SQL and NoSQL in one paragraph:",
max_new_tokens=200,
temperature=0.7,
do_sample=True
)
# Question answering (extractive)
qa = pipeline("question-answering")
result = qa(
question="What is the capital of France?",
context="Paris is the capital and largest city of France."
)
# {'answer': 'Paris', 'score': 0.998, 'start': 0, 'end': 5}
# Summarisation
summariser = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summariser(long_text, max_length=130, min_length=30)
# Translation
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-no")
translated = translator("Hello, how are you?")
# Zero-shot classification (no training needed)
classifier = pipeline("zero-shot-classification")
result = classifier(
"I love playing chess and solving puzzles",
candidate_labels=["sports", "technology", "gaming", "food"]
)
# {'labels': ['gaming', 'sports', ...], 'scores': [0.78, 0.12, ...]}Tokenizers — Understanding What Models Actually See
Before a model sees text, it is tokenized — split into subword tokens and converted to integer IDs. Understanding this is essential for correct input formatting.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Basic tokenization
tokens = tokenizer("Hello world, this is a test!")
print(tokens)
# {'input_ids': [101, 7592, 2088, 1010, 2023, 2003, 1037, 3231, 999, 102],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# Decode back to text
tokenizer.decode(tokens["input_ids"])
# '[CLS] hello world, this is a test! [SEP]'
# Batch encoding with padding and truncation
batch = tokenizer(
["Short text.", "This is a much longer text that needs padding to match the batch."],
padding=True, # pad shorter sequences to batch max length
truncation=True, # truncate to model max length
max_length=512,
return_tensors="pt" # "pt" = PyTorch tensors, "tf" = TensorFlow, "np" = NumPy
)
print(batch["input_ids"].shape) # torch.Size([2, 512])
# Chat template for instruction-tuned models
messages = [
{"role": "system", "content": "You are a helpful Python tutor."},
{"role": "user", "content": "What is a list comprehension?"},
]
chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
formatted = chat_tokenizer.apply_chat_template(messages, tokenize=False)Loading Models: AutoClasses
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
# Auto-selects the right class based on model config
model = AutoModel.from_pretrained("bert-base-uncased")
# Task-specific model heads
clf_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
gen_model = AutoModelForCausalLM.from_pretrained("gpt2")
# Load with quantization (reduced memory — run large models on consumer GPUs)
from transformers import BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
quantization_config=quant_config, # 7B model in ~5GB VRAM instead of ~14GB
device_map="auto" # auto-distribute across available GPUs
)Embeddings — Sentence Transformers
Sentence Transformers produce fixed-size embeddings suitable for semantic search, clustering, and RAG:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") # fast, good quality
# or: "BAAI/bge-large-en-v1.5" (state-of-the-art quality)
# or: "nomic-ai/nomic-embed-text-v1.5" (long context, open weights)
sentences = [
"How do I fix a slow PostgreSQL query?",
"PostgreSQL query optimisation techniques",
"How to bake a chocolate cake",
]
embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape) # (3, 384)
# Cosine similarity (1.0 = identical, 0.0 = unrelated)
similarities = np.dot(embeddings, embeddings.T)
print(similarities)
# [[1.00 0.78 0.12]
# [0.78 1.00 0.09]
# [0.12 0.09 1.00]]
# → Query 0 and 1 are very similar (0.78), both very different from query 2Fine-Tuning with the Trainer API
Fine-tuning adapts a pre-trained model to your specific task and data. The Trainer API handles the training loop, evaluation, checkpointing, and logging.
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
import numpy as np
import evaluate
# Load dataset
dataset = load_dataset("imdb") # or your custom dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenised = dataset.map(preprocess, batched=True, remove_columns=["text"])
tokenised = tokenised.rename_column("label", "labels")
# Model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2
)
# Training configuration
training_args = TrainingArguments(
output_dir="./results/sentiment-model",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
report_to="mlflow", # integrates with MLflow tracking
fp16=True, # mixed precision (faster on GPU)
)
accuracy_metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return accuracy_metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["test"],
tokenizer=tokenizer,
data_collator=DataCollatorWithPadding(tokenizer),
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("./my-sentiment-model")PEFT / LoRA — Fine-Tune Large Models on Consumer Hardware
Full fine-tuning of a 7B+ parameter model requires 80GB+ VRAM. PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation) fine-tunes only a tiny fraction of parameters — achieving near-identical results with a fraction of the resources.
Full fine-tuning: update all 7 billion parameters
LoRA: add small trainable matrices (rank r) alongside frozen weights
only ~0.1-1% of total parameters are trained
7B model fine-tunable on a single 24GB GPUfrom peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
# Load base model (frozen)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
load_in_4bit=True, # QLoRA: quantised + LoRA
device_map="auto"
)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank of the update matrices (higher = more capacity, more params)
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 9,437,184 || all params: 3,761,856,512 || trainable%: 0.25%
# Train with SFTTrainer (instruction fine-tuning)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text", # column containing formatted prompt+response
max_seq_length=2048,
args=TrainingArguments(
output_dir="./mistral-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch size = 16
learning_rate=2e-4,
fp16=True,
save_steps=100,
)
)
trainer.train()
# Save only the LoRA weights (tiny — a few hundred MB vs 14GB for full model)
model.save_pretrained("./mistral-lora-weights")Inference API and Inference Endpoints
Serverless Inference API (No GPU needed for testing)
import requests
API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
headers = {"Authorization": f"Bearer {HF_API_TOKEN}"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
result = query({"inputs": "This movie was absolutely brilliant!"})
# [{'label': 'POSITIVE', 'score': 0.9998}]Dedicated Inference Endpoints (Production)
For production traffic, deploy a dedicated endpoint — your model, your hardware, private:
from huggingface_hub import InferenceClient
# Connect to your dedicated endpoint
client = InferenceClient(
model="https://your-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud",
token=HF_API_TOKEN
)
# Text generation with streaming
for token in client.text_generation(
"Explain transformer attention in simple terms:",
max_new_tokens=300,
stream=True,
temperature=0.7
):
print(token, end="", flush=True)
# Embeddings
embeddings = client.feature_extraction("PostgreSQL indexing strategies")Pushing Models to the Hub
from huggingface_hub import HfApi
# Login
from huggingface_hub import login
login(token=HF_TOKEN)
# Push model and tokenizer
model.push_to_hub("my-org/sentiment-classifier-v2")
tokenizer.push_to_hub("my-org/sentiment-classifier-v2")
# Create a model card (README)
from huggingface_hub import ModelCard
card = ModelCard("""
---
language: en
tags:
- text-classification
- sentiment-analysis
metrics:
- accuracy
---
# Sentiment Classifier v2
Fine-tuned distilbert-base-uncased on IMDb dataset.
## Performance
- Accuracy: 93.2%
- F1: 0.932
## Usage
```python
from transformers import pipeline
clf = pipeline("sentiment-analysis", model="my-org/sentiment-classifier-v2")
clf("I loved this product!")""") card.push_to_hub("my-org/sentiment-classifier-v2")
---
## Choosing the Right Model
| Task | Recommended models | Notes |
|------|-------------------|-------|
| Text classification | distilbert, roberta | Fast, accurate for most tasks |
| Text generation | Mistral-7B, Llama-3-8B | Best open-weight options 2024 |
| Embeddings | nomic-embed-text-v1.5, BGE-large | High quality, long context |
| Summarisation | bart-large-cnn, pegasus | Abstractive summarisation |
| Translation | opus-mt series, NLLB-200 | 200 language pairs |
| Code generation | deepseek-coder, starcoder2 | Strong code completion |
| Vision-Language | LLaVA-1.6, PaliGemma | Image + text multimodal |
| Fine-tuning (efficient) | Any 7B model + LoRA | Use QLoRA for single GPU |
---
**Related:** [Azure OpenAI Guide](/articles/azure-openai-guide) — GPT-4o and embeddings on Azure
**Related:** [MLflow Experiment Tracking](/articles/mlflow-experiment-tracking) — track fine-tuning runs
**Related:** [Building a Production RAG Pipeline](/articles/building-production-rag-pipeline)Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.