Back to blog
Data Engineeringintermediate

Building ETL Pipelines with Azure Data Factory

Design and build production data pipelines using Azure Data Factory β€” from source extraction to data lake storage with real-world patterns.

LearnixoMarch 25, 20264 min read
AzureData FactoryETLData LakeSQLPipeline
Share:𝕏

Why Azure Data Factory?

When your data lives in 5 different places and needs to end up in one, you need an orchestration tool. Azure Data Factory (ADF) is Microsoft's cloud ETL/ELT service β€” and it's surprisingly powerful once you understand its patterns.

The Architecture

Source Systems          Azure Data Factory          Destinations
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SQL DB   │──────────▢│   Pipeline   │──────────▢│  Data Lake   β”‚
β”‚  REST API │──────────▢│   + Dataflow │──────────▢│  SQL DW      β”‚
β”‚  Files    │──────────▢│              │──────────▢│  Cosmos DB   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Concepts

1. Linked Services

Connections to your data sources and destinations:

JSON
{
  "name": "SqlServerSource",
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "@{linkedService().connectionString}"
  }
}

Best practice: Use Azure Key Vault for connection strings. Never hardcode credentials.

2. Datasets

Schema definitions for your data:

JSON
{
  "name": "CustomerTable",
  "type": "AzureSqlTable",
  "linkedServiceName": "SqlServerSource",
  "typeProperties": {
    "tableName": "dbo.Customers"
  }
}

3. Pipelines

The orchestration logic β€” where things happen in order:

  • Copy Activity: Move data from A to B
  • Data Flow: Transform data with a visual editor
  • Stored Procedure: Run SQL transformations
  • Web Activity: Call REST APIs
  • If/ForEach: Control flow

A Real Pipeline: Customer Data Sync

Let's build a pipeline that:

  1. Extracts customer data from a SQL database
  2. Transforms it (clean, deduplicate, enrich)
  3. Loads it into a Data Lake as Parquet files

Step 1: Extract

Use a Copy Activity to pull data:

JSON
{
  "name": "ExtractCustomers",
  "type": "Copy",
  "inputs": [{ "referenceName": "SqlCustomers" }],
  "outputs": [{ "referenceName": "RawCustomerParquet" }],
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource",
      "sqlReaderQuery": "SELECT * FROM Customers WHERE ModifiedDate > @{pipeline().parameters.lastRun}"
    },
    "sink": {
      "type": "ParquetSink"
    }
  }
}

Key pattern: Incremental loads. Don't extract everything every time β€” use a watermark column like ModifiedDate.

Step 2: Transform

Use a Data Flow for transformations:

  1. Source: Read from raw Parquet files
  2. Filter: Remove invalid records (null email, test accounts)
  3. Derived Column: Normalize phone numbers, standardize names
  4. Aggregate: Deduplicate by email
  5. Sink: Write to curated Data Lake zone

Step 3: Load

Write the clean data to your analytics layer:

Raw Zone (Bronze)  β†’  Curated Zone (Silver)  β†’  Analytics Zone (Gold)
   ↑                       ↑                        ↑
 As-is copy           Cleaned data           Business-ready

This is the Medallion Architecture β€” a proven pattern for data lakes.

Production Patterns

Error Handling

Every production pipeline needs error handling:

Pipeline Start
    β”‚
    β”œβ”€β”€ Extract ──── On Failure ──▢ Send Alert
    β”‚      β”‚
    β”‚   On Success
    β”‚      β”‚
    β”œβ”€β”€ Transform ── On Failure ──▢ Log Error + Retry
    β”‚      β”‚
    β”‚   On Success
    β”‚      β”‚
    └── Load ─────── On Failure ──▢ Quarantine Bad Data

Parameterization

Make pipelines reusable:

JSON
{
  "parameters": {
    "sourceTable": { "type": "string" },
    "targetPath": { "type": "string" },
    "lastRun": { "type": "string" }
  }
}

Now one pipeline handles multiple tables.

Monitoring

ADF has built-in monitoring, but also:

  • Log pipeline runs to a SQL table
  • Track row counts (source vs. destination)
  • Alert on anomalies (50% fewer rows than usual = something broke)

Common Mistakes

  1. Full loads every time β€” Use incremental extraction with watermarks
  2. No error handling β€” Silent failures are worse than loud ones
  3. Ignoring costs β€” Data flows (Spark clusters) are expensive. Use copy activities when you can
  4. Hardcoded values β€” Parameterize everything
  5. No testing β€” Test with sample data before running on production

Cost Optimization

  • Use Self-hosted Integration Runtime for on-premises sources
  • Schedule pipelines during off-peak hours
  • Use Copy Activity instead of Data Flows for simple transformations
  • Set TTL on Data Flow clusters to avoid cold start costs

Next Steps

In the next article, we'll cover:

  • Delta Lake integration with ADF
  • CI/CD for ADF pipelines with Azure DevOps
  • Monitoring and alerting patterns
  • Data quality validation

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.