Back to blog
AI Systemsintermediate

Hugging Face Transformers — From Model Hub to Production

Complete Hugging Face guide — Model Hub, pipelines, tokenizers, fine-tuning with Trainer API, PEFT/LoRA for efficient fine-tuning, Inference API, and deploying models to production with Inference Endpoints.

SystemForgeApril 18, 20268 min read
Hugging FaceTransformersLLMFine-tuningPEFTLoRANLPPythonMachine Learning
Share:𝕏

Hugging Face is the GitHub of machine learning models — 500,000+ pre-trained models, 150,000+ datasets, and the transformers library that standardises how you load, run, and fine-tune them. Whether you need sentiment analysis, text generation, embeddings, or a custom fine-tuned LLM, Hugging Face is the starting point.


The Ecosystem

┌────────────────────────────────────────────────────────────────┐
│                    Hugging Face Ecosystem                       │
│                                                                 │
│  Model Hub          Datasets          Spaces                   │
│  ──────────         ────────          ──────                   │
│  500k+ models       150k+ datasets    Hosted demos             │
│  Any framework      Streaming-ready   Gradio / Streamlit       │
│                                                                 │
│  Libraries                                                      │
│  ─────────                                                      │
│  transformers    Core: models, tokenizers, pipelines           │
│  datasets        Load and process datasets efficiently         │
│  peft            Parameter-efficient fine-tuning (LoRA)        │
│  trl             RLHF / SFT training                           │
│  accelerate      Multi-GPU / distributed training              │
│  evaluate        Standardised metrics (BLEU, ROUGE, F1...)     │
│  tokenizers      Fast Rust-based tokenization                  │
└────────────────────────────────────────────────────────────────┘

Pipelines — Zero-Code Inference

pipeline() is the fastest way to use any model — it handles tokenization, model forward pass, and post-processing:

Python
from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This model is absolutely fantastic!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York and Paris.")
# [{'entity_group': 'ORG', 'word': 'Hugging Face', ...}, {'entity_group': 'LOC', ...}]

# Text generation
generator = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
output = generator(
    "Explain the difference between SQL and NoSQL in one paragraph:",
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True
)

# Question answering (extractive)
qa = pipeline("question-answering")
result = qa(
    question="What is the capital of France?",
    context="Paris is the capital and largest city of France."
)
# {'answer': 'Paris', 'score': 0.998, 'start': 0, 'end': 5}

# Summarisation
summariser = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summariser(long_text, max_length=130, min_length=30)

# Translation
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-no")
translated = translator("Hello, how are you?")

# Zero-shot classification (no training needed)
classifier = pipeline("zero-shot-classification")
result = classifier(
    "I love playing chess and solving puzzles",
    candidate_labels=["sports", "technology", "gaming", "food"]
)
# {'labels': ['gaming', 'sports', ...], 'scores': [0.78, 0.12, ...]}

Tokenizers — Understanding What Models Actually See

Before a model sees text, it is tokenized — split into subword tokens and converted to integer IDs. Understanding this is essential for correct input formatting.

Python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Basic tokenization
tokens = tokenizer("Hello world, this is a test!")
print(tokens)
# {'input_ids': [101, 7592, 2088, 1010, 2023, 2003, 1037, 3231, 999, 102],
#  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# Decode back to text
tokenizer.decode(tokens["input_ids"])
# '[CLS] hello world, this is a test! [SEP]'

# Batch encoding with padding and truncation
batch = tokenizer(
    ["Short text.", "This is a much longer text that needs padding to match the batch."],
    padding=True,          # pad shorter sequences to batch max length
    truncation=True,       # truncate to model max length
    max_length=512,
    return_tensors="pt"    # "pt" = PyTorch tensors, "tf" = TensorFlow, "np" = NumPy
)
print(batch["input_ids"].shape)  # torch.Size([2, 512])

# Chat template for instruction-tuned models
messages = [
    {"role": "system", "content": "You are a helpful Python tutor."},
    {"role": "user", "content": "What is a list comprehension?"},
]
chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
formatted = chat_tokenizer.apply_chat_template(messages, tokenize=False)

Loading Models: AutoClasses

Python
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM

# Auto-selects the right class based on model config
model = AutoModel.from_pretrained("bert-base-uncased")

# Task-specific model heads
clf_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
gen_model  = AutoModelForCausalLM.from_pretrained("gpt2")

# Load with quantization (reduced memory  run large models on consumer GPUs)
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantization_config=quant_config,   # 7B model in ~5GB VRAM instead of ~14GB
    device_map="auto"                   # auto-distribute across available GPUs
)

Embeddings — Sentence Transformers

Sentence Transformers produce fixed-size embeddings suitable for semantic search, clustering, and RAG:

Python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # fast, good quality
# or: "BAAI/bge-large-en-v1.5" (state-of-the-art quality)
# or: "nomic-ai/nomic-embed-text-v1.5" (long context, open weights)

sentences = [
    "How do I fix a slow PostgreSQL query?",
    "PostgreSQL query optimisation techniques",
    "How to bake a chocolate cake",
]

embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape)  # (3, 384)

# Cosine similarity (1.0 = identical, 0.0 = unrelated)
similarities = np.dot(embeddings, embeddings.T)
print(similarities)
# [[1.00  0.78  0.12]
#  [0.78  1.00  0.09]
#  [0.12  0.09  1.00]]
#  Query 0 and 1 are very similar (0.78), both very different from query 2

Fine-Tuning with the Trainer API

Fine-tuning adapts a pre-trained model to your specific task and data. The Trainer API handles the training loop, evaluation, checkpointing, and logging.

Python
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
import numpy as np
import evaluate

# Load dataset
dataset = load_dataset("imdb")   # or your custom dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenised = dataset.map(preprocess, batched=True, remove_columns=["text"])
tokenised = tokenised.rename_column("label", "labels")

# Model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

# Training configuration
training_args = TrainingArguments(
    output_dir="./results/sentiment-model",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="mlflow",             # integrates with MLflow tracking
    fp16=True,                      # mixed precision (faster on GPU)
)

accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("./my-sentiment-model")

PEFT / LoRA — Fine-Tune Large Models on Consumer Hardware

Full fine-tuning of a 7B+ parameter model requires 80GB+ VRAM. PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation) fine-tunes only a tiny fraction of parameters — achieving near-identical results with a fraction of the resources.

Full fine-tuning: update all 7 billion parameters
LoRA:             add small trainable matrices (rank r) alongside frozen weights
                  only ~0.1-1% of total parameters are trained
                  7B model fine-tunable on a single 24GB GPU
Python
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer

# Load base model (frozen)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_4bit=True,        # QLoRA: quantised + LoRA
    device_map="auto"
)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # rank of the update matrices (higher = more capacity, more params)
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 9,437,184 || all params: 3,761,856,512 || trainable%: 0.25%

# Train with SFTTrainer (instruction fine-tuning)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",          # column containing formatted prompt+response
    max_seq_length=2048,
    args=TrainingArguments(
        output_dir="./mistral-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # effective batch size = 16
        learning_rate=2e-4,
        fp16=True,
        save_steps=100,
    )
)

trainer.train()

# Save only the LoRA weights (tiny  a few hundred MB vs 14GB for full model)
model.save_pretrained("./mistral-lora-weights")

Inference API and Inference Endpoints

Serverless Inference API (No GPU needed for testing)

Python
import requests

API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
headers = {"Authorization": f"Bearer {HF_API_TOKEN}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

result = query({"inputs": "This movie was absolutely brilliant!"})
# [{'label': 'POSITIVE', 'score': 0.9998}]

Dedicated Inference Endpoints (Production)

For production traffic, deploy a dedicated endpoint — your model, your hardware, private:

Python
from huggingface_hub import InferenceClient

# Connect to your dedicated endpoint
client = InferenceClient(
    model="https://your-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud",
    token=HF_API_TOKEN
)

# Text generation with streaming
for token in client.text_generation(
    "Explain transformer attention in simple terms:",
    max_new_tokens=300,
    stream=True,
    temperature=0.7
):
    print(token, end="", flush=True)

# Embeddings
embeddings = client.feature_extraction("PostgreSQL indexing strategies")

Pushing Models to the Hub

Python
from huggingface_hub import HfApi

# Login
from huggingface_hub import login
login(token=HF_TOKEN)

# Push model and tokenizer
model.push_to_hub("my-org/sentiment-classifier-v2")
tokenizer.push_to_hub("my-org/sentiment-classifier-v2")

# Create a model card (README)
from huggingface_hub import ModelCard

card = ModelCard("""
---
language: en
tags:
- text-classification
- sentiment-analysis
metrics:
- accuracy
---

# Sentiment Classifier v2

Fine-tuned distilbert-base-uncased on IMDb dataset.

## Performance
- Accuracy: 93.2%
- F1: 0.932

## Usage
```python
from transformers import pipeline
clf = pipeline("sentiment-analysis", model="my-org/sentiment-classifier-v2")
clf("I loved this product!")

""") card.push_to_hub("my-org/sentiment-classifier-v2")


---

## Choosing the Right Model

| Task | Recommended models | Notes |
|------|-------------------|-------|
| Text classification | distilbert, roberta | Fast, accurate for most tasks |
| Text generation | Mistral-7B, Llama-3-8B | Best open-weight options 2024 |
| Embeddings | nomic-embed-text-v1.5, BGE-large | High quality, long context |
| Summarisation | bart-large-cnn, pegasus | Abstractive summarisation |
| Translation | opus-mt series, NLLB-200 | 200 language pairs |
| Code generation | deepseek-coder, starcoder2 | Strong code completion |
| Vision-Language | LLaVA-1.6, PaliGemma | Image + text multimodal |
| Fine-tuning (efficient) | Any 7B model + LoRA | Use QLoRA for single GPU |

---

**Related:** [Azure OpenAI Guide](/articles/azure-openai-guide) — GPT-4o and embeddings on Azure  
**Related:** [MLflow Experiment Tracking](/articles/mlflow-experiment-tracking) — track fine-tuning runs  
**Related:** [Building a Production RAG Pipeline](/articles/building-production-rag-pipeline)

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.