This project outlines a data pipeline using Azure Data Factory (ADF) and Databricks for data ingestion, transformation, and processing across different storage layers: Raw, Bronze, Silver, and Gold.
-
Create a Resource Group:
- Navigate to Azure Portal
- Create a new resource group for organizing resources.
-
Create a Storage Account:
- Under the resource group, create an Azure Data Lake Storage (ADLS) account.
- Enable the Hierarchical Namespace to support directories.
-
Create Storage Containers:
- Create the following containers:
raw,bronze,silver,gold.
- Create the following containers:
-
Create an Azure Data Factory Resource:
- Set up an ADF instance within the resource group.
-
Set Up Linked Services:
- HTTP Linked Service:
- Configure it to extract raw data from an external source.
- Data Lake Linked Service:
- Configure it to store the extracted raw data.
- HTTP Linked Service:
-
Create an ADF Pipeline:
- Copy Activity:
- Define source and sink connections.
- Use the HTTP linked service for data extraction.
- Use the Data Lake linked service to store the data in the Bronze layer.
- Configure dynamic parameters for file names in the
relative URL. - Configure dynamic parameters (
folder_nameandfile_name) for Data Lake storage.
- Copy Activity:
-
Implement Iteration and Validation:
- ForEach Activity:
- Create an array of folder and file names as pipeline parameters.
- Move the
Copy Activityinside theForEachloop. - Retrieve folder and file names dynamically from the loop.
- Validation and Metadata Extraction:
- Add
Validation Activity,Web Activityto fetch metadata, andSet Variable Activityto store the metadata.
- Add
- ForEach Activity:
-
Create a Databricks Account:
- Use a trial premium account for setup.
-
Connect to Admin Console:
- Provide the Microsoft Extra ID for authentication.
-
Create a Metastore:
- Only one Metastore per region is allowed.
-
Set Up Access Connector:
- Connect Databricks with Data Lake storage.
- Assign IAM roles in the Storage Account.
- Attach the access connector as a member.
-
Connect to Azure Databricks Workspace.
-
Create a Catalog and schema and External Storage for all the layers:
- Attach credentials using the Access Connector Resource ID.
-
Create an Autoloader Notebook:
- Configure it to automatically read the raw layer title cloud data streams and write to the Bronze layer.
-
Implement Silver Layer Transformation:
- Create a Silver notebook to move other data which is saved in bronze layer using ADF for each activity pipeline to Silver.
- Create a Lookup Array Notebook to dynamically fetch source and target folder names using
dbutils.widgetsandtaskValues.
-
Set Up Workflows in Databricks:
- Create an Iterative Task for the Silver Notebook.
- Use
taskValuesfor iteration and parameter passing. - Implement a Lookup Task to pass folder names to
silver_dimnotebooks. - Define a workflow to move
netflix_titlesin bronze layer to silver layer with a condition that executes only ifworkDay == 7.
- Create Delta Live Tables (DLT) Pipeline:
- Use Job Clusters instead of All-Purpose Clusters.
- Define conditions and rules for Gold Layer processing using Delta Live Tables features.
- Study Autoloader Schema Evolution to handle schema changes dynamically.
- Ensure access permissions are correctly assigned in IAM for Data Lake Storage.
- Optimize workflows to handle large-scale data efficiently.
This project establishes an end-to-end pipeline for data ingestion, processing, and transformation using Azure Data Factory and Databricks. By leveraging dynamic parameterization, automated workflows, and Delta Live Tables, it ensures efficient data processing across different storage layers.








