AI Systems Engineering · Lesson 2 of 2
Azure Data Factory ETL
Why Azure Data Factory?
When your data lives in 5 different places and needs to end up in one, you need an orchestration tool. Azure Data Factory (ADF) is Microsoft's cloud ETL/ELT service — and it's surprisingly powerful once you understand its patterns.
The Architecture
Source Systems Azure Data Factory Destinations
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ SQL DB │──────────▶│ Pipeline │──────────▶│ Data Lake │
│ REST API │──────────▶│ + Dataflow │──────────▶│ SQL DW │
│ Files │──────────▶│ │──────────▶│ Cosmos DB │
└──────────┘ └──────────────┘ └──────────────┘Core Concepts
1. Linked Services
Connections to your data sources and destinations:
{
"name": "SqlServerSource",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "@{linkedService().connectionString}"
}
}Best practice: Use Azure Key Vault for connection strings. Never hardcode credentials.
2. Datasets
Schema definitions for your data:
{
"name": "CustomerTable",
"type": "AzureSqlTable",
"linkedServiceName": "SqlServerSource",
"typeProperties": {
"tableName": "dbo.Customers"
}
}3. Pipelines
The orchestration logic — where things happen in order:
- Copy Activity: Move data from A to B
- Data Flow: Transform data with a visual editor
- Stored Procedure: Run SQL transformations
- Web Activity: Call REST APIs
- If/ForEach: Control flow
A Real Pipeline: Customer Data Sync
Let's build a pipeline that:
- Extracts customer data from a SQL database
- Transforms it (clean, deduplicate, enrich)
- Loads it into a Data Lake as Parquet files
Step 1: Extract
Use a Copy Activity to pull data:
{
"name": "ExtractCustomers",
"type": "Copy",
"inputs": [{ "referenceName": "SqlCustomers" }],
"outputs": [{ "referenceName": "RawCustomerParquet" }],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM Customers WHERE ModifiedDate > @{pipeline().parameters.lastRun}"
},
"sink": {
"type": "ParquetSink"
}
}
}Key pattern: Incremental loads. Don't extract everything every time — use a watermark column like ModifiedDate.
Step 2: Transform
Use a Data Flow for transformations:
- Source: Read from raw Parquet files
- Filter: Remove invalid records (null email, test accounts)
- Derived Column: Normalize phone numbers, standardize names
- Aggregate: Deduplicate by email
- Sink: Write to curated Data Lake zone
Step 3: Load
Write the clean data to your analytics layer:
Raw Zone (Bronze) → Curated Zone (Silver) → Analytics Zone (Gold)
↑ ↑ ↑
As-is copy Cleaned data Business-readyThis is the Medallion Architecture — a proven pattern for data lakes.
Production Patterns
Error Handling
Every production pipeline needs error handling:
Pipeline Start
│
├── Extract ──── On Failure ──▶ Send Alert
│ │
│ On Success
│ │
├── Transform ── On Failure ──▶ Log Error + Retry
│ │
│ On Success
│ │
└── Load ─────── On Failure ──▶ Quarantine Bad DataParameterization
Make pipelines reusable:
{
"parameters": {
"sourceTable": { "type": "string" },
"targetPath": { "type": "string" },
"lastRun": { "type": "string" }
}
}Now one pipeline handles multiple tables.
Monitoring
ADF has built-in monitoring, but also:
- Log pipeline runs to a SQL table
- Track row counts (source vs. destination)
- Alert on anomalies (50% fewer rows than usual = something broke)
Common Mistakes
- Full loads every time — Use incremental extraction with watermarks
- No error handling — Silent failures are worse than loud ones
- Ignoring costs — Data flows (Spark clusters) are expensive. Use copy activities when you can
- Hardcoded values — Parameterize everything
- No testing — Test with sample data before running on production
Cost Optimization
- Use Self-hosted Integration Runtime for on-premises sources
- Schedule pipelines during off-peak hours
- Use Copy Activity instead of Data Flows for simple transformations
- Set TTL on Data Flow clusters to avoid cold start costs
Next Steps
In the next article, we'll cover:
- Delta Lake integration with ADF
- CI/CD for ADF pipelines with Azure DevOps
- Monitoring and alerting patterns
- Data quality validation