Building ETL Pipelines with Azure Data Factory
Design and build production data pipelines using Azure Data Factory β from source extraction to data lake storage with real-world patterns.
Why Azure Data Factory?
When your data lives in 5 different places and needs to end up in one, you need an orchestration tool. Azure Data Factory (ADF) is Microsoft's cloud ETL/ELT service β and it's surprisingly powerful once you understand its patterns.
The Architecture
Source Systems Azure Data Factory Destinations
ββββββββββββ ββββββββββββββββ ββββββββββββββββ
β SQL DB ββββββββββββΆβ Pipeline ββββββββββββΆβ Data Lake β
β REST API ββββββββββββΆβ + Dataflow ββββββββββββΆβ SQL DW β
β Files ββββββββββββΆβ ββββββββββββΆβ Cosmos DB β
ββββββββββββ ββββββββββββββββ ββββββββββββββββCore Concepts
1. Linked Services
Connections to your data sources and destinations:
{
"name": "SqlServerSource",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "@{linkedService().connectionString}"
}
}Best practice: Use Azure Key Vault for connection strings. Never hardcode credentials.
2. Datasets
Schema definitions for your data:
{
"name": "CustomerTable",
"type": "AzureSqlTable",
"linkedServiceName": "SqlServerSource",
"typeProperties": {
"tableName": "dbo.Customers"
}
}3. Pipelines
The orchestration logic β where things happen in order:
- Copy Activity: Move data from A to B
- Data Flow: Transform data with a visual editor
- Stored Procedure: Run SQL transformations
- Web Activity: Call REST APIs
- If/ForEach: Control flow
A Real Pipeline: Customer Data Sync
Let's build a pipeline that:
- Extracts customer data from a SQL database
- Transforms it (clean, deduplicate, enrich)
- Loads it into a Data Lake as Parquet files
Step 1: Extract
Use a Copy Activity to pull data:
{
"name": "ExtractCustomers",
"type": "Copy",
"inputs": [{ "referenceName": "SqlCustomers" }],
"outputs": [{ "referenceName": "RawCustomerParquet" }],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM Customers WHERE ModifiedDate > @{pipeline().parameters.lastRun}"
},
"sink": {
"type": "ParquetSink"
}
}
}Key pattern: Incremental loads. Don't extract everything every time β use a watermark column like ModifiedDate.
Step 2: Transform
Use a Data Flow for transformations:
- Source: Read from raw Parquet files
- Filter: Remove invalid records (null email, test accounts)
- Derived Column: Normalize phone numbers, standardize names
- Aggregate: Deduplicate by email
- Sink: Write to curated Data Lake zone
Step 3: Load
Write the clean data to your analytics layer:
Raw Zone (Bronze) β Curated Zone (Silver) β Analytics Zone (Gold)
β β β
As-is copy Cleaned data Business-readyThis is the Medallion Architecture β a proven pattern for data lakes.
Production Patterns
Error Handling
Every production pipeline needs error handling:
Pipeline Start
β
βββ Extract ββββ On Failure βββΆ Send Alert
β β
β On Success
β β
βββ Transform ββ On Failure βββΆ Log Error + Retry
β β
β On Success
β β
βββ Load βββββββ On Failure βββΆ Quarantine Bad DataParameterization
Make pipelines reusable:
{
"parameters": {
"sourceTable": { "type": "string" },
"targetPath": { "type": "string" },
"lastRun": { "type": "string" }
}
}Now one pipeline handles multiple tables.
Monitoring
ADF has built-in monitoring, but also:
- Log pipeline runs to a SQL table
- Track row counts (source vs. destination)
- Alert on anomalies (50% fewer rows than usual = something broke)
Common Mistakes
- Full loads every time β Use incremental extraction with watermarks
- No error handling β Silent failures are worse than loud ones
- Ignoring costs β Data flows (Spark clusters) are expensive. Use copy activities when you can
- Hardcoded values β Parameterize everything
- No testing β Test with sample data before running on production
Cost Optimization
- Use Self-hosted Integration Runtime for on-premises sources
- Schedule pipelines during off-peak hours
- Use Copy Activity instead of Data Flows for simple transformations
- Set TTL on Data Flow clusters to avoid cold start costs
Next Steps
In the next article, we'll cover:
- Delta Lake integration with ADF
- CI/CD for ADF pipelines with Azure DevOps
- Monitoring and alerting patterns
- Data quality validation
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.