AI Systems Engineering · Lesson 2 of 2

Azure Data Factory ETL

Why Azure Data Factory?

When your data lives in 5 different places and needs to end up in one, you need an orchestration tool. Azure Data Factory (ADF) is Microsoft's cloud ETL/ELT service — and it's surprisingly powerful once you understand its patterns.

The Architecture

Source Systems          Azure Data Factory          Destinations
┌──────────┐           ┌──────────────┐           ┌──────────────┐
│  SQL DB   │──────────▶│   Pipeline   │──────────▶│  Data Lake   │
│  REST API │──────────▶│   + Dataflow │──────────▶│  SQL DW      │
│  Files    │──────────▶│              │──────────▶│  Cosmos DB   │
└──────────┘           └──────────────┘           └──────────────┘

Core Concepts

1. Linked Services

Connections to your data sources and destinations:

JSON
{
  "name": "SqlServerSource",
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "@{linkedService().connectionString}"
  }
}

Best practice: Use Azure Key Vault for connection strings. Never hardcode credentials.

2. Datasets

Schema definitions for your data:

JSON
{
  "name": "CustomerTable",
  "type": "AzureSqlTable",
  "linkedServiceName": "SqlServerSource",
  "typeProperties": {
    "tableName": "dbo.Customers"
  }
}

3. Pipelines

The orchestration logic — where things happen in order:

  • Copy Activity: Move data from A to B
  • Data Flow: Transform data with a visual editor
  • Stored Procedure: Run SQL transformations
  • Web Activity: Call REST APIs
  • If/ForEach: Control flow

A Real Pipeline: Customer Data Sync

Let's build a pipeline that:

  1. Extracts customer data from a SQL database
  2. Transforms it (clean, deduplicate, enrich)
  3. Loads it into a Data Lake as Parquet files

Step 1: Extract

Use a Copy Activity to pull data:

JSON
{
  "name": "ExtractCustomers",
  "type": "Copy",
  "inputs": [{ "referenceName": "SqlCustomers" }],
  "outputs": [{ "referenceName": "RawCustomerParquet" }],
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource",
      "sqlReaderQuery": "SELECT * FROM Customers WHERE ModifiedDate > @{pipeline().parameters.lastRun}"
    },
    "sink": {
      "type": "ParquetSink"
    }
  }
}

Key pattern: Incremental loads. Don't extract everything every time — use a watermark column like ModifiedDate.

Step 2: Transform

Use a Data Flow for transformations:

  1. Source: Read from raw Parquet files
  2. Filter: Remove invalid records (null email, test accounts)
  3. Derived Column: Normalize phone numbers, standardize names
  4. Aggregate: Deduplicate by email
  5. Sink: Write to curated Data Lake zone

Step 3: Load

Write the clean data to your analytics layer:

Raw Zone (Bronze)  →  Curated Zone (Silver)  →  Analytics Zone (Gold)
   ↑                       ↑                        ↑
 As-is copy           Cleaned data           Business-ready

This is the Medallion Architecture — a proven pattern for data lakes.

Production Patterns

Error Handling

Every production pipeline needs error handling:

Pipeline Start
    │
    ├── Extract ──── On Failure ──▶ Send Alert
    │      │
    │   On Success
    │      │
    ├── Transform ── On Failure ──▶ Log Error + Retry
    │      │
    │   On Success
    │      │
    └── Load ─────── On Failure ──▶ Quarantine Bad Data

Parameterization

Make pipelines reusable:

JSON
{
  "parameters": {
    "sourceTable": { "type": "string" },
    "targetPath": { "type": "string" },
    "lastRun": { "type": "string" }
  }
}

Now one pipeline handles multiple tables.

Monitoring

ADF has built-in monitoring, but also:

  • Log pipeline runs to a SQL table
  • Track row counts (source vs. destination)
  • Alert on anomalies (50% fewer rows than usual = something broke)

Common Mistakes

  1. Full loads every time — Use incremental extraction with watermarks
  2. No error handling — Silent failures are worse than loud ones
  3. Ignoring costs — Data flows (Spark clusters) are expensive. Use copy activities when you can
  4. Hardcoded values — Parameterize everything
  5. No testing — Test with sample data before running on production

Cost Optimization

  • Use Self-hosted Integration Runtime for on-premises sources
  • Schedule pipelines during off-peak hours
  • Use Copy Activity instead of Data Flows for simple transformations
  • Set TTL on Data Flow clusters to avoid cold start costs

Next Steps

In the next article, we'll cover:

  • Delta Lake integration with ADF
  • CI/CD for ADF pipelines with Azure DevOps
  • Monitoring and alerting patterns
  • Data quality validation