/data-engineering-zoomcamp-2023

This Repo contains weekly learnings for data-engineering-zoomcamp 2023 course.

Primary LanguageJupyter Notebook

data-engineering-zoomcamp-2023

This Repo contains weekly learnings for data-engineering-zoomcamp 2023 course.

Week 1

Day 1

  • Introduction to the course
  • Introduction to the Docker
  • How to use Docker ENTRYPOINT for command line arguments
docker build -t test .
docker run -it test 2023-01-01

Day 2

  • setting up the docker service for postgres
  • Ruuning the postgres service locally using docker
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-p 5432:5432 \
-v /Users/mahesh.sinha/Desktop/personal/Repos/data-engineering-zoomcamp-2023/week-1-env-setup/2_docker_sql/ny_taxi_pgadmin_data:/var/lib/postgresql/data \
postgres:13.3
  • Connecting to the postgres service using pgcli
pgcli -h localhost -U root -p 5432 -d ny_taxi
  • Listing the tables in the database
\dt
  • Count the number of rows in the table
SELECT COUNT(*) FROM yellow_taxi_data;
  • Created upload_data.py script to download from web and upload the data to the postgres database.

Day 3

  • Coverted the upload_data.ipynb to injest_data.py with command line arguments
python ingest_data.py \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="localhost" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
  • Created a dockerfile to build the ingest-data.py script
docker build -t ingest-data:v001 .
  • Ran the injest_data.py script using docker
docker run -it ingest_data:v001 \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="localhost" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
  • Created docker network to connect the postgres and injest_data containers
docker network create ny_taxi_network
  • Ran the postgres container in the network
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-p 5432:5432 \
-v ny_taxi_postgres_data:/var/lib/postgresql/data \
--network ny_taxi_network \
--name postgres-db \
postgres:13.3
  • Ran the injest_data container in the network
docker run -it \
--network ny_taxi_network \
ingest-data:v001 \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="postgres-db" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
  • Created a docker-compose.yml file to run the postgres and pgadmin containers
docker-compose up
  • Run the injest_data container:
python ingest_data.py \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="localhost" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
  • Run the pgadmin container:
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-p 5050:80 \
dpage/pgadmin4

Day 4

  • Created taxi_zones table in the ny_taxi database from upload_data.ipynb notebook
  • Ran some queries on the yellow_taxi_trips and taxi_zones tables
SELECT COUNT(*) FROM yellow_taxi_trips;

SELECT COUNT(*) FROM taxi_zones;

SELECT 
tpep_pickup_datetime,
tpep_dropoff_datetime,
fare_amount,
CONCAT(zpl_id."Borough", ' / ', zpl_id."Zone") as PickupLocation,
CONCAT(zdl_id."Borough", ' / ', zdl_id."Zone") as DropoffLocation
FROM 
yellow_taxi_trips as t,
taxi_zones as zpl_id,
taxi_zones as zdl_id
WHERE 
t."PULocationID" = zpl_id."LocationID" AND
t."DOLocationID" = zdl_id."LocationID"
LIMIT 100;

SELECT 
CAST(t."tpep_pickup_datetime" as DATE) AS PickupDate,
COUNT(1) as count
FROM 
yellow_taxi_trips as t
GROUP BY (CAST(t."tpep_pickup_datetime" as DATE))
ORDER BY count DESC
LIMIT 100
  • Created GCP free tier account and created a new project.
  • Created a new service account and downloaded the json key file.
  • Edit the service account permissions to give it access to the GCP resources like BigQuery-Admin, Storage-Admin, StorageObject-Admin.
  • Download the gcloud sdk and installed it.
brew install --cask google-cloud-sdk
  • Setup environment variables for the service account
export GOOGLE_APPLICATION_CREDENTIALS="<path-to-service-account-json-file>"
  • Authenticated the gcloud sdk
gcloud auth application-default login
  • Download terraform and installed it.
brew install terraform
  • Created a terraform.tfvars file.
  • Created a main.tf file to create the GCP resources.
  • CReated a variables.tf file to define the variables.
  • Ran the terraform init command to initialize the terraform.
terraform init
  • Ran the terraform plan command to see the changes that will be made.
terraform plan
  • Ran the terraform apply command to create the resources.
terraform apply

Day 5

  • Created gcp-de-2023 ssh key pair
ssh-keygen -t rsa -f ~/.ssh/gcp-de-2023 -C i.mahesh.sinha -b 2048
  • Added the public key to the GCP Compute Engine Metadata
  • Created a new GCP Compute Engine instance from the GCP console.
  • ssh into the instance
ssh -i ~/.ssh/gcp-de-2023 i.mahesh.sinha@<instance-ip>
  • Download the Anaconda installer
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh
  • Updated the .ssh/config file to add the GCP instance
Host de-zoomcamp-2023
  HostName <INSTANCE-IP>
  User <USERNAME>
  IdentityFile <PATH-TO-SSH-KEY>
  • Install docker
sudo apt-get update
sudo apt-get install docker.io
  • Enable docker without sudo
sudo groupadd docker
sudo gpasswd -a $USER docker
<REBOOT THE INSTANCE>
  • Install docker-compose
mkdir bin
cd bin
wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-linux-x86_64 -O docker-compose
chmod +x docker-compose  # make it executable
  • Add below line in .bashrc file, so that we can run docker-compose without specifying the full path
export PATH="${HOME}/bin:${PATH}"
  • Clone the repo
git clone <REPO URL>
  • Run the docker-compose file
docker-compose up
  • Install pgcli
conda install -c conda-forge pgcli
pip install -U mycli

Day 6

  • Download terraform on the GCP instance
wget https://releases.hashicorp.com/terraform/1.3.7/terraform_1.3.7_linux_amd64.zip
unzip terraform_1.3.7_linux_amd64.zip
terraform --version
  • Copy the GCP service account json file to the GCP instance using sftp
sftp de-zoomcamp-2023
mkdir .gc
put <LOCAL-PATH-TO-SERVICE-ACCOUNT-JSON-FILE> .gc
  • Configure the GCP credentials using the service account json file
export GOOGLE_APPLICATION_CREDENTIALS="<PATH-TO-SERVICE-ACCOUNT-JSON-FILE>"
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
  • Copied the terraform state file from local to GCP instance
sftp de-zoomcamp-2023
put <LOCAL-PATH-TO-STATE-FILE> terraform.tfstate
  • Ran the terraform init command to initialize the terraform.
terraform init
  • Ran the terraform plan command to see the changes that will be made.
terraform plan
  • Home work Week-1 NOTEBOOK URL: week-1-env-setup/Homework-Solutions Week-1.ipynb