/BestBuy

Primary LanguagePython

Best Buy - For Jorge Vasquez

Built With

Getting Started

The purpose of this mini project was to demonstrate data integration of multiple data sources into the Azure Data Lake pipeline, and visualize in PowerBI.

Why I used a Data Lake over a Data Warehouse?

  • Data Lakes Handle unstructured to semiunstructured data types very well
  • Data does not need to be structured to a set schema, thus can be read very easily and quickly
  • Data is loosely structured in a lake, work can be very agile and changes can be made on the fly.
  • Integrating new data sources is very easy, data does not need to be transformed prior entering the lake.

First Step - Create Dummy Data Sources

  • Created different dummy data sources with
  • AWS S3 - Simple Storage Service - Adwords Dummy Data
  • Local Python API: connects my Adwords account to Azure Data Lake

screen shot 2018-08-12 at 9 18 30 pm

screen shot 2018-08-12 at 9 21 12 pm

Second Step - Create Azure Data Lake Store

-Create an Azure Data Lake Store (When you first create it, theres no data) screen shot 2018-08-12 at 9 22 59 pm

Third Step - Create a Azure Data Factory Pipeline between S3 and ADLS

-Create a data pipeline and a job process to run batch integration jobs between S3 Data to ADLS.

  • Here I'm transferring over all 75 TSV files containing dummy Adwords data into our ADLS Store.

screen shot 2018-08-12 at 9 26 11 pm

  • What the Dataset looks like

screen shot 2018-08-12 at 9 36 57 pm

Fourth Step - Create Azure Data Lake Analytics Tool

-Once data has been loaded into ADLS, it has to be transformed to be visualized or used by data scientists/analysts alike. -We need to re-compile all 75 TSV files into 1 TSV in order to continue with our work.

Writing a U-SQL Wrangling/Processing Script

-In order to transform the data at hand, we need to write a U-SQL Script (Combination between SQL + .NET Code) -This helps clear out some of the Missing/Null Values -Sets correct Encoding (Unicode-UTF-8)

screen shot 2018-08-12 at 9 31 37 pm

Submit the job, and visualize the processing/computation.

screen shot 2018-08-12 at 9 29 04 pm

PowerBI - Take the transformed/compiled dataset and load into PowerBI

39070117_1350083351791227_4854565186360573952_n

Final Thoughts

  • We weren't allowed to use our Developer Token until it was approved by Google. Couldn't pull data from our production account.
  • Some of the data is not transformmed 100% correctly, but for the purpose of demonstration, this is a quick way to create a data pipeline that can allow analysts/scientists to visualize data sources quickly and responsively.