MLflow — Experiment Tracking, Model Registry & Deployment

MLflow is the open-source platform that brings software engineering discipline to machine learning: reproducible experiments, versioned models, and consistent deployment. Without it, ML projects become a folder of model_v2_final_FINAL.pkl files with no record of what parameters produced them or whether they actually improved on the baseline.

The Four Components

┌─────────────────────────────────────────────────────────────┐
│                          MLflow                              │
│                                                             │
│  Tracking      Model Registry    Projects      Models       │
│  ──────────    ──────────────    ────────      ──────       │
│  Log params,   Version & stage   Reproducible  Serve as     │
│  metrics,      models:           environments  REST API,    │
│  artifacts     Staging → Prod    (conda, pip)  batch, or    │
│  per run       per model                       serverless   │
└─────────────────────────────────────────────────────────────┘

Tracking: Logging Experiments

Every training run is an experiment containing one or more runs. Each run logs parameters, metrics, and artifacts (model files, plots, datasets).

Basic Tracking

Python

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
import pandas as pd

# Set tracking URI (local, remote server, or Azure ML / Databricks)
mlflow.set_tracking_uri("http://localhost:5000")   # local MLflow server
# mlflow.set_tracking_uri("azureml://...")           # Azure ML
# mlflow.set_tracking_uri("databricks")             # Databricks

mlflow.set_experiment("customer-churn-prediction")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

with mlflow.start_run(run_name="rf-baseline"):
    # Log hyperparameters
    params = {"n_estimators": 100, "max_depth": 5, "min_samples_split": 10}
    mlflow.log_params(params)

    # Train
    model = RandomForestClassifier(**params, random_state=42)
    model.fit(X_train, y_train)

    # Log metrics
    preds = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, preds))
    mlflow.log_metric("f1_score", f1_score(y_test, preds, average="weighted"))

    # Log the model (with input schema for validation)
    signature = mlflow.models.infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        signature=signature,
        input_example=X_train.iloc[:3]
    )

    # Log artifacts (plots, feature importance, confusion matrix)
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots()
    pd.Series(model.feature_importances_, index=X.columns).sort_values().plot.barh(ax=ax)
    plt.tight_layout()
    mlflow.log_figure(fig, "feature_importance.png")

    print(f"Run ID: {mlflow.active_run().info.run_id}")

Autologging — Zero-Code Instrumentation

MLflow autolog captures parameters, metrics, and models automatically for supported frameworks:

Python

mlflow.autolog()   # enables all supported frameworks

# Works automatically for: sklearn, XGBoost, LightGBM, PyTorch, TensorFlow/Keras, Spark MLlib
with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    # MLflow has already logged: all params, CV metrics, model, feature importance

Logging Training Curves (Deep Learning)

Python

import mlflow
import torch
import torch.nn as nn

with mlflow.start_run():
    mlflow.log_params({
        "learning_rate": 1e-3,
        "batch_size": 32,
        "epochs": 50,
        "architecture": "ResNet18"
    })

    for epoch in range(50):
        train_loss = train_one_epoch(model, train_loader, optimizer)
        val_loss, val_acc = evaluate(model, val_loader)

        # Log step-level metrics (creates time series in UI)
        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_accuracy": val_acc
        }, step=epoch)

    # Log PyTorch model
    mlflow.pytorch.log_model(model, "model")

Comparing Runs — Finding the Best Model

Python

import mlflow

client = mlflow.MlflowClient()
experiment = client.get_experiment_by_name("customer-churn-prediction")

# Query runs: filter, sort, get top performers
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string="metrics.f1_score > 0.85 AND params.n_estimators > 50",
    order_by=["metrics.f1_score DESC"],
    max_results=10
)

for run in runs:
    print(f"Run: {run.info.run_id[:8]}  "
          f"F1: {run.data.metrics['f1_score']:.4f}  "
          f"Params: {run.data.params}")

# Get the best run
best_run = runs[0]
print(f"Best run: {best_run.info.run_id}")
print(f"Best F1:  {best_run.data.metrics['f1_score']:.4f}")

Model Registry: From Experiment to Production

The Model Registry is the versioning system for production-ready models. Each registered model has versions and lifecycle stages.

Stages:
  None      → newly registered, not promoted
  Staging   → validated, testing in pre-prod
  Production → live, serving real traffic
  Archived  → replaced, kept for reference

Register and Promote a Model

Python

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a model from a completed run
model_uri = f"runs:/{best_run.info.run_id}/model"
registered = mlflow.register_model(model_uri, "ChurnPredictor")

print(f"Model version: {registered.version}")

# Add description and tags
client.update_model_version(
    name="ChurnPredictor",
    version=registered.version,
    description="RandomForest trained on Q1 2026 data. F1=0.891"
)
client.set_model_version_tag("ChurnPredictor", registered.version,
                             "validated_by", "data-science-team")

# Promote to staging after validation
client.transition_model_version_stage(
    name="ChurnPredictor",
    version=registered.version,
    stage="Staging",
    archive_existing_versions=False
)

# Promote to production (archives current production version)
client.transition_model_version_stage(
    name="ChurnPredictor",
    version=registered.version,
    stage="Production",
    archive_existing_versions=True   # old production → Archived
)

Loading Models by Stage

Python

# Always loads the current Production version — no code change needed when you promote
model = mlflow.sklearn.load_model("models:/ChurnPredictor/Production")
predictions = model.predict(new_data)

# Load specific version (for A/B testing or rollback)
model_v3 = mlflow.sklearn.load_model("models:/ChurnPredictor/3")

# Load from Staging for validation
staging_model = mlflow.sklearn.load_model("models:/ChurnPredictor/Staging")

Serving Models as REST APIs

Local Serving (Testing)

Bash

# Serve any registered model or run artifact
mlflow models serve \
  --model-uri "models:/ChurnPredictor/Production" \
  --port 5001 \
  --env-manager conda

Bash

# Call the endpoint
curl -X POST http://localhost:5001/invocations \
  -H "Content-Type: application/json" \
  -d '{"dataframe_records": [{"age": 35, "tenure": 24, "monthly_charge": 85.5}]}'

Containerised Deployment

Bash

# Build a Docker image containing the model + server
mlflow models build-docker \
  --model-uri "models:/ChurnPredictor/Production" \
  --name "churn-predictor:v3"

# Run locally
docker run -p 5001:8080 churn-predictor:v3

# Deploy to AKS, Azure Container Apps, or any container platform

Custom Python Model (Preprocessing + Model Pipeline)

Python

class ChurnPipelineModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import pickle
        with open(context.artifacts["preprocessor"], "rb") as f:
            self.preprocessor = pickle.load(f)
        self.model = mlflow.sklearn.load_model(context.artifacts["model"])

    def predict(self, context, model_input):
        processed = self.preprocessor.transform(model_input)
        return self.model.predict_proba(processed)[:, 1]

with mlflow.start_run():
    # Save preprocessor as artifact
    mlflow.log_artifact("preprocessor.pkl")

    mlflow.pyfunc.log_model(
        artifact_path="pipeline",
        python_model=ChurnPipelineModel(),
        artifacts={
            "preprocessor": "preprocessor.pkl",
            "model": f"runs:/{base_run_id}/model"
        },
        conda_env={
            "channels": ["defaults"],
            "dependencies": ["python=3.11", "scikit-learn=1.4.0", "pandas=2.0.0"]
        }
    )

MLflow on Azure ML

Azure Machine Learning integrates MLflow natively — use the same MLflow API, Azure handles the storage and compute.

Python

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Connect to Azure ML workspace
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="...",
    resource_group_name="rg-ml",
    workspace_name="mlws-prod"
)

# Get the MLflow tracking URI from the workspace
tracking_uri = ml_client.workspaces.get("mlws-prod").mlflow_tracking_uri
mlflow.set_tracking_uri(tracking_uri)

# Now all mlflow.log_* calls go to Azure ML
# Models registered via mlflow.register_model appear in Azure ML Model Registry
# Runs appear in Azure ML Experiments UI

MLflow on Databricks

Databricks includes a managed MLflow server — no setup required.

Python

# In a Databricks notebook — MLflow is already configured
import mlflow

# Databricks uses the workspace tracking server automatically
# No mlflow.set_tracking_uri needed

with mlflow.start_run(experiment_id="/Users/me@company.com/churn-experiment"):
    mlflow.log_param("model_type", "xgboost")
    # ... train and log as normal

# Register to Databricks Unity Catalog Model Registry
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
    f"runs:/{run_id}/model",
    "main.ml_models.churn_predictor"   # catalog.schema.model_name
)

Production MLOps Workflow

Data Scientists:
  experiment → track with MLflow → compare runs → register best model

MLOps / CI pipeline:
  new model registered → automated validation tests
    → if pass: promote to Staging
    → integration tests: load from Staging URI, run inference on test set
    → if pass: promote to Production

Production:
  service loads "models:/ModelName/Production" at startup
  → new model version in Production = automatic rollout
  → rollback: demote version, promote previous

Related: Databricks Guide — Delta Lake, Spark, MLflow on Databricks
Related: Hugging Face Transformers — fine-tuning and deploying LLMs
Related: Building a Production RAG Pipeline

MLflow — Experiment Tracking, Model Registry & Deployment

The Four Components

Tracking: Logging Experiments

Basic Tracking

Autologging — Zero-Code Instrumentation

Logging Training Curves (Deep Learning)

Comparing Runs — Finding the Best Model

Model Registry: From Experiment to Production

Register and Promote a Model

Loading Models by Stage

Serving Models as REST APIs

Local Serving (Testing)

Containerised Deployment

Custom Python Model (Preprocessing + Model Pipeline)

MLflow on Azure ML

MLflow on Databricks

Production MLOps Workflow

Enjoyed this article?

Leave a comment