Hugging Face Transformers — From Model Hub to Production

Hugging Face is the GitHub of machine learning models — 500,000+ pre-trained models, 150,000+ datasets, and the transformers library that standardises how you load, run, and fine-tune them. Whether you need sentiment analysis, text generation, embeddings, or a custom fine-tuned LLM, Hugging Face is the starting point.

The Ecosystem

┌────────────────────────────────────────────────────────────────┐
│                    Hugging Face Ecosystem                       │
│                                                                 │
│  Model Hub          Datasets          Spaces                   │
│  ──────────         ────────          ──────                   │
│  500k+ models       150k+ datasets    Hosted demos             │
│  Any framework      Streaming-ready   Gradio / Streamlit       │
│                                                                 │
│  Libraries                                                      │
│  ─────────                                                      │
│  transformers    Core: models, tokenizers, pipelines           │
│  datasets        Load and process datasets efficiently         │
│  peft            Parameter-efficient fine-tuning (LoRA)        │
│  trl             RLHF / SFT training                           │
│  accelerate      Multi-GPU / distributed training              │
│  evaluate        Standardised metrics (BLEU, ROUGE, F1...)     │
│  tokenizers      Fast Rust-based tokenization                  │
└────────────────────────────────────────────────────────────────┘

Pipelines — Zero-Code Inference

pipeline() is the fastest way to use any model — it handles tokenization, model forward pass, and post-processing:

Python

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This model is absolutely fantastic!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York and Paris.")
# [{'entity_group': 'ORG', 'word': 'Hugging Face', ...}, {'entity_group': 'LOC', ...}]

# Text generation
generator = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
output = generator(
    "Explain the difference between SQL and NoSQL in one paragraph:",
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True
)

# Question answering (extractive)
qa = pipeline("question-answering")
result = qa(
    question="What is the capital of France?",
    context="Paris is the capital and largest city of France."
)
# {'answer': 'Paris', 'score': 0.998, 'start': 0, 'end': 5}

# Summarisation
summariser = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summariser(long_text, max_length=130, min_length=30)

# Translation
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-no")
translated = translator("Hello, how are you?")

# Zero-shot classification (no training needed)
classifier = pipeline("zero-shot-classification")
result = classifier(
    "I love playing chess and solving puzzles",
    candidate_labels=["sports", "technology", "gaming", "food"]
)
# {'labels': ['gaming', 'sports', ...], 'scores': [0.78, 0.12, ...]}

Tokenizers — Understanding What Models Actually See

Before a model sees text, it is tokenized — split into subword tokens and converted to integer IDs. Understanding this is essential for correct input formatting.

Python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Basic tokenization
tokens = tokenizer("Hello world, this is a test!")
print(tokens)
# {'input_ids': [101, 7592, 2088, 1010, 2023, 2003, 1037, 3231, 999, 102],
#  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# Decode back to text
tokenizer.decode(tokens["input_ids"])
# '[CLS] hello world, this is a test! [SEP]'

# Batch encoding with padding and truncation
batch = tokenizer(
    ["Short text.", "This is a much longer text that needs padding to match the batch."],
    padding=True,          # pad shorter sequences to batch max length
    truncation=True,       # truncate to model max length
    max_length=512,
    return_tensors="pt"    # "pt" = PyTorch tensors, "tf" = TensorFlow, "np" = NumPy
)
print(batch["input_ids"].shape)  # torch.Size([2, 512])

# Chat template for instruction-tuned models
messages = [
    {"role": "system", "content": "You are a helpful Python tutor."},
    {"role": "user", "content": "What is a list comprehension?"},
]
chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
formatted = chat_tokenizer.apply_chat_template(messages, tokenize=False)

Loading Models: AutoClasses

Python

from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM

# Auto-selects the right class based on model config
model = AutoModel.from_pretrained("bert-base-uncased")

# Task-specific model heads
clf_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
gen_model  = AutoModelForCausalLM.from_pretrained("gpt2")

# Load with quantization (reduced memory — run large models on consumer GPUs)
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantization_config=quant_config,   # 7B model in ~5GB VRAM instead of ~14GB
    device_map="auto"                   # auto-distribute across available GPUs
)

Embeddings — Sentence Transformers

Sentence Transformers produce fixed-size embeddings suitable for semantic search, clustering, and RAG:

Python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # fast, good quality
# or: "BAAI/bge-large-en-v1.5" (state-of-the-art quality)
# or: "nomic-ai/nomic-embed-text-v1.5" (long context, open weights)

sentences = [
    "How do I fix a slow PostgreSQL query?",
    "PostgreSQL query optimisation techniques",
    "How to bake a chocolate cake",
]

embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape)  # (3, 384)

# Cosine similarity (1.0 = identical, 0.0 = unrelated)
similarities = np.dot(embeddings, embeddings.T)
print(similarities)
# [[1.00  0.78  0.12]
#  [0.78  1.00  0.09]
#  [0.12  0.09  1.00]]
# → Query 0 and 1 are very similar (0.78), both very different from query 2

Fine-Tuning with the Trainer API

Fine-tuning adapts a pre-trained model to your specific task and data. The Trainer API handles the training loop, evaluation, checkpointing, and logging.

Python

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
import numpy as np
import evaluate

# Load dataset
dataset = load_dataset("imdb")   # or your custom dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenised = dataset.map(preprocess, batched=True, remove_columns=["text"])
tokenised = tokenised.rename_column("label", "labels")

# Model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

# Training configuration
training_args = TrainingArguments(
    output_dir="./results/sentiment-model",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="mlflow",             # integrates with MLflow tracking
    fp16=True,                      # mixed precision (faster on GPU)
)

accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("./my-sentiment-model")

PEFT / LoRA — Fine-Tune Large Models on Consumer Hardware

Full fine-tuning of a 7B+ parameter model requires 80GB+ VRAM. PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation) fine-tunes only a tiny fraction of parameters — achieving near-identical results with a fraction of the resources.

Full fine-tuning: update all 7 billion parameters
LoRA:             add small trainable matrices (rank r) alongside frozen weights
                  only ~0.1-1% of total parameters are trained
                  7B model fine-tunable on a single 24GB GPU

Python

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer

# Load base model (frozen)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_4bit=True,        # QLoRA: quantised + LoRA
    device_map="auto"
)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # rank of the update matrices (higher = more capacity, more params)
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 9,437,184 || all params: 3,761,856,512 || trainable%: 0.25%

# Train with SFTTrainer (instruction fine-tuning)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",          # column containing formatted prompt+response
    max_seq_length=2048,
    args=TrainingArguments(
        output_dir="./mistral-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # effective batch size = 16
        learning_rate=2e-4,
        fp16=True,
        save_steps=100,
    )
)

trainer.train()

# Save only the LoRA weights (tiny — a few hundred MB vs 14GB for full model)
model.save_pretrained("./mistral-lora-weights")

Inference API and Inference Endpoints

Serverless Inference API (No GPU needed for testing)

Python

import requests

API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
headers = {"Authorization": f"Bearer {HF_API_TOKEN}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

result = query({"inputs": "This movie was absolutely brilliant!"})
# [{'label': 'POSITIVE', 'score': 0.9998}]

Dedicated Inference Endpoints (Production)

For production traffic, deploy a dedicated endpoint — your model, your hardware, private:

Python

from huggingface_hub import InferenceClient

# Connect to your dedicated endpoint
client = InferenceClient(
    model="https://your-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud",
    token=HF_API_TOKEN
)

# Text generation with streaming
for token in client.text_generation(
    "Explain transformer attention in simple terms:",
    max_new_tokens=300,
    stream=True,
    temperature=0.7
):
    print(token, end="", flush=True)

# Embeddings
embeddings = client.feature_extraction("PostgreSQL indexing strategies")

Pushing Models to the Hub

Python

from huggingface_hub import HfApi

# Login
from huggingface_hub import login
login(token=HF_TOKEN)

# Push model and tokenizer
model.push_to_hub("my-org/sentiment-classifier-v2")
tokenizer.push_to_hub("my-org/sentiment-classifier-v2")

# Create a model card (README)
from huggingface_hub import ModelCard

card = ModelCard("""
---
language: en
tags:
- text-classification
- sentiment-analysis
metrics:
- accuracy
---

# Sentiment Classifier v2

Fine-tuned distilbert-base-uncased on IMDb dataset.

## Performance
- Accuracy: 93.2%
- F1: 0.932

## Usage
```python
from transformers import pipeline
clf = pipeline("sentiment-analysis", model="my-org/sentiment-classifier-v2")
clf("I loved this product!")

""") card.push_to_hub("my-org/sentiment-classifier-v2")


---

## Choosing the Right Model

| Task | Recommended models | Notes |
|------|-------------------|-------|
| Text classification | distilbert, roberta | Fast, accurate for most tasks |
| Text generation | Mistral-7B, Llama-3-8B | Best open-weight options 2024 |
| Embeddings | nomic-embed-text-v1.5, BGE-large | High quality, long context |
| Summarisation | bart-large-cnn, pegasus | Abstractive summarisation |
| Translation | opus-mt series, NLLB-200 | 200 language pairs |
| Code generation | deepseek-coder, starcoder2 | Strong code completion |
| Vision-Language | LLaVA-1.6, PaliGemma | Image + text multimodal |
| Fine-tuning (efficient) | Any 7B model + LoRA | Use QLoRA for single GPU |

---

**Related:** [Azure OpenAI Guide](/articles/azure-openai-guide) — GPT-4o and embeddings on Azure  
**Related:** [MLflow Experiment Tracking](/articles/mlflow-experiment-tracking) — track fine-tuning runs  
**Related:** [Building a Production RAG Pipeline](/articles/building-production-rag-pipeline)

Hugging Face Transformers — From Model Hub to Production

The Ecosystem

Pipelines — Zero-Code Inference

Tokenizers — Understanding What Models Actually See

Loading Models: AutoClasses

Embeddings — Sentence Transformers

Fine-Tuning with the Trainer API

PEFT / LoRA — Fine-Tune Large Models on Consumer Hardware

Inference API and Inference Endpoints

Serverless Inference API (No GPU needed for testing)

Dedicated Inference Endpoints (Production)

Pushing Models to the Hub

Enjoyed this article?

Leave a comment