Azure Data Factory ETL — AI Systems Engineering | Learnixo

Why Azure Data Factory?

When your data lives in 5 different places and needs to end up in one, you need an orchestration tool. Azure Data Factory (ADF) is Microsoft's cloud ETL/ELT service — and it's surprisingly powerful once you understand its patterns.

The Architecture

Source Systems          Azure Data Factory          Destinations
┌──────────┐           ┌──────────────┐           ┌──────────────┐
│  SQL DB   │──────────▶│   Pipeline   │──────────▶│  Data Lake   │
│  REST API │──────────▶│   + Dataflow │──────────▶│  SQL DW      │
│  Files    │──────────▶│              │──────────▶│  Cosmos DB   │
└──────────┘           └──────────────┘           └──────────────┘

Core Concepts

1. Linked Services

Connections to your data sources and destinations:

JSON

{
  "name": "SqlServerSource",
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "@{linkedService().connectionString}"
  }
}

Best practice: Use Azure Key Vault for connection strings. Never hardcode credentials.

2. Datasets

Schema definitions for your data:

JSON

{
  "name": "CustomerTable",
  "type": "AzureSqlTable",
  "linkedServiceName": "SqlServerSource",
  "typeProperties": {
    "tableName": "dbo.Customers"
  }
}

3. Pipelines

The orchestration logic — where things happen in order:

Copy Activity: Move data from A to B
Data Flow: Transform data with a visual editor
Stored Procedure: Run SQL transformations
Web Activity: Call REST APIs
If/ForEach: Control flow

A Real Pipeline: Customer Data Sync

Let's build a pipeline that:

Extracts customer data from a SQL database
Transforms it (clean, deduplicate, enrich)
Loads it into a Data Lake as Parquet files

Step 1: Extract

Use a Copy Activity to pull data:

JSON

{
  "name": "ExtractCustomers",
  "type": "Copy",
  "inputs": [{ "referenceName": "SqlCustomers" }],
  "outputs": [{ "referenceName": "RawCustomerParquet" }],
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource",
      "sqlReaderQuery": "SELECT * FROM Customers WHERE ModifiedDate > @{pipeline().parameters.lastRun}"
    },
    "sink": {
      "type": "ParquetSink"
    }
  }
}

Key pattern: Incremental loads. Don't extract everything every time — use a watermark column like ModifiedDate.

Step 2: Transform

Use a Data Flow for transformations:

Source: Read from raw Parquet files
Filter: Remove invalid records (null email, test accounts)
Derived Column: Normalize phone numbers, standardize names
Aggregate: Deduplicate by email
Sink: Write to curated Data Lake zone

Step 3: Load

Write the clean data to your analytics layer:

Raw Zone (Bronze)  →  Curated Zone (Silver)  →  Analytics Zone (Gold)
   ↑                       ↑                        ↑
 As-is copy           Cleaned data           Business-ready

This is the Medallion Architecture — a proven pattern for data lakes.

Production Patterns

Error Handling

Every production pipeline needs error handling:

Pipeline Start
    │
    ├── Extract ──── On Failure ──▶ Send Alert
    │      │
    │   On Success
    │      │
    ├── Transform ── On Failure ──▶ Log Error + Retry
    │      │
    │   On Success
    │      │
    └── Load ─────── On Failure ──▶ Quarantine Bad Data

Parameterization

Make pipelines reusable:

JSON

{
  "parameters": {
    "sourceTable": { "type": "string" },
    "targetPath": { "type": "string" },
    "lastRun": { "type": "string" }
  }
}

Now one pipeline handles multiple tables.

Monitoring

ADF has built-in monitoring, but also:

Log pipeline runs to a SQL table
Track row counts (source vs. destination)
Alert on anomalies (50% fewer rows than usual = something broke)

Common Mistakes

Full loads every time — Use incremental extraction with watermarks
No error handling — Silent failures are worse than loud ones
Ignoring costs — Data flows (Spark clusters) are expensive. Use copy activities when you can
Hardcoded values — Parameterize everything
No testing — Test with sample data before running on production

Cost Optimization

Use Self-hosted Integration Runtime for on-premises sources
Schedule pipelines during off-peak hours
Use Copy Activity instead of Data Flows for simple transformations
Set TTL on Data Flow clusters to avoid cold start costs

Next Steps

In the next article, we'll cover:

Delta Lake integration with ADF
CI/CD for ADF pipelines with Azure DevOps
Monitoring and alerting patterns
Data quality validation