/Data-Filtering-Pipeline-ETL

This Project Extracts supply chain data from csv file having 180k records and more than 40 columns from the Azure Datalake Gen2 storage account and do some dataanalysis with Python(Pandas) to find the top 3 countries and filtered the data for top 3 countries and finally transferred it to 3 files in datalake again by creating ETL pipeline in ADF.

Primary LanguageJupyter Notebook

Data-Filtering-Pipeline-ETL

Problem Statement

I was given a csv file having 180k rows. First I had to find a data of top 3 countries by Order Country column which I did by python pandas and get EstadiosUnidos,Francia and Mexico as top 3 countries with most records then my mission was to extract data from datalake filtered it by those countries and again send it to datalake in 3 seperate files countrywise. The problem that I faced during the pipeline was that data was raw and it contained special characters that it was not filtering out then I change its encoding from UTF-8 to ISO-8859-1 in AzureDataFactory and problem resolved.

The files

The Second Portfolio Project.json file contains information about the ADF pipeline, including the pipeline name, description, and the resources that make up the pipeline. The manifest.json file contains information about the dependencies and structure of the ARM template of the pipeline in Azure DataFactory. Jupyter notebook file contain the dataanalysis in python.

Workflow

Untitled Diagram drawio

Pipeline Structure

Capture

Finally Filtered Data

WhatsApp Image 2023-04-29 at 23 24 06