Data-Warehouse-Data-Engineering-Zoomcamp-Week-3

Data Warehouse & Big Query

This week we learned more about Data Warehouse & Big Query.

What is Data Warehouse?

Data Warehouse is a central repository to store data from various sources.
Data might be transformed before getting loaded into a database (ETL).
In some cases, data might not be transformed but it maintains inside the Data Warehouse (in a staging area). After data get transformed (ELT), which also happens inside the Data Warehouse, then it will be loaded into a database.

What is Big Query?

BigQuery (BQ) is a serverless Data Warehouse, managed by Google.
BQ supports all data types and works across cloud.
And BQ supports SQL-like query, which we will be working on today

Scope of Work

We will put knowledge from previous weeks into action.

Tasks of week 3 are:

Collect data from sources
Store data in google cloud
Work with data in Big Query (Create an external table & native table, Compared partitioned data vs non-partitioned data and observe its performance)
Query data with SQL

And this week we are free to use any tools/method in our favorites.

Since I've been working with the Prefect Orchestration tool in the past weeks. I chose to continue working with Prefect because of its simple yet effective.

Data Collection & Store in Google Cloud Storage (GCS)

By using Prefect, we can create a workflow to complete 2 tasks in one action, which are:

Data collection
Upload data into Google Cloud

Steps:

Create a python file, I named it "data_to_gcs.py" which contains functions as follow:

from pathlib import Path
import pandas as pd
from prefect import flow, task
from prefect_gcp.cloud_storage import GcsBucket
from prefect.tasks import task_input_hash


@task(retries=3)
def fetch(dataset_url: str) -> pd.DataFrame:
    """Read taxi data from web into pandas DataFrame"""
    df = pd.read_csv(dataset_url)
    return df
    

@task()
def write_local(df: pd.DataFrame, dataset_file: str) -> Path:
    """Write DataFrame out locally as csv file"""
    path = Path(f"data/fhv/{dataset_file}.csv")
    df.to_csv(path, compression="gzip")
    return path


@task()
def write_gcs(path: Path) -> None:
    """Upload local csv file to GCS"""
    gcs_block = GcsBucket.load("zoom-gcs")
    gcs_block.upload_from_path(from_path=path, to_path=path, timeout=120)
    return


@flow()
def web_to_gcs(year: int, month: int) -> None:
    """The main function"""
    dataset_file = f"fhv_tripdata_{year}-{month:02}"
    dataset_url = f"https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/{dataset_file}.csv.gz"

    df = fetch(dataset_url)
    path = write_local(df, dataset_file)
    write_gcs(path)


@flow()
def etl_parent_flow(
    months: list[int] = [1, 2], year: int = 2019
):
    for month in months:
        web_to_gcs(year, month)

if __name__ == "__main__":
    months = list(range(1,13))
    year = 2019
    etl_parent_flow(months, year)

Run data_to_gcs.py
Now all data are stored in local & on GCS

Google Big Query

After having data stored in GCS.

You can now work with data using Big Query.

Steps:

Create dataset
Create table (2 options: Native table and External table)
Load data into the table

What are the different between those two types of tables?

External Table in Big Query

Instead of create a table and store all the data in that table inside the Big Query, we create an external table.
External table is a table definition (which you will use to query data). But all the data will be stored outside Big Query. In this case, it's stored in GCS.
Even our data is not in the Big Query but we can still query them using all the tool e.g. SQL. This concept is called "federated query".

Native Table

Native table is a table that stores all data inside Big Query
More expensive than using external table

Step 1: Create Dataset

Check your region that you've set on your Bucket On Terminal, run

gsutil ls -Lb gs://<your bucket name>

Back to your gcloud,

Go to >> Big Query

Click ⋮ (three dot next to your project name)
- Setup dataset name & region
- Click Create Dataset

Step 2: Create Table

How to Create an External Table

On your gcloud/bigquery

Click + sign (which is "compose new query")
Run the query below to creat an external table

-- Creating external table referring to gcs path
CREATE OR REPLACE EXTERNAL TABLE `<project name>.<dataset name>.<table name>`
OPTIONS (
  format = 'CSV',
  uris = ['gs://<bucket name>/path/to/csv file/<file name>.csv']
);

How to Create a Native Table & Load Data into Table

On your gcloud/bigquery

Click +ADD DATA
Select Google Cloud Storage as Source
- Browse file that you want to work on or write path to file as follow, to gather multiple files into a table
  
  <your bucket name>/path/to/file/fhv_tripdata_2019-*.csv
- Define dataset name, table name
- Select table type: Native Table
- Create table

Now you have your table & can proceed to answer the questions in homework!

SQL Query

One you have your dataset & table.

You can now using SQL tool to query data.

Refering to Homework Week 3

Here are links to some useful query:

Big Query SQL

Big Query Homework

==========Thank You==========

suphawadeeth/Data-Warehouse-Data-Engineering-Zoomcamp-Week-3