data-engineering-zoomcamp-2023
This Repo contains weekly learnings for data-engineering-zoomcamp 2023 course.
Week 1
Day 1
- Introduction to the course
- Introduction to the Docker
- How to use Docker ENTRYPOINT for command line arguments
docker build -t test .
docker run -it test 2023-01-01
Day 2
- setting up the docker service for postgres
- Ruuning the postgres service locally using docker
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-p 5432:5432 \
-v /Users/mahesh.sinha/Desktop/personal/Repos/data-engineering-zoomcamp-2023/week-1-env-setup/2_docker_sql/ny_taxi_pgadmin_data:/var/lib/postgresql/data \
postgres:13.3
- Connecting to the postgres service using pgcli
pgcli -h localhost -U root -p 5432 -d ny_taxi
- Listing the tables in the database
\dt
- Count the number of rows in the table
SELECT COUNT(*) FROM yellow_taxi_data;
- Created upload_data.py script to download from web and upload the data to the postgres database.
Day 3
- Coverted the upload_data.ipynb to injest_data.py with command line arguments
python ingest_data.py \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="localhost" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
- Created a dockerfile to build the ingest-data.py script
docker build -t ingest-data:v001 .
- Ran the injest_data.py script using docker
docker run -it ingest_data:v001 \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="localhost" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
- Created docker network to connect the postgres and injest_data containers
docker network create ny_taxi_network
- Ran the postgres container in the network
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-p 5432:5432 \
-v ny_taxi_postgres_data:/var/lib/postgresql/data \
--network ny_taxi_network \
--name postgres-db \
postgres:13.3
- Ran the injest_data container in the network
docker run -it \
--network ny_taxi_network \
ingest-data:v001 \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="postgres-db" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
- Created a docker-compose.yml file to run the postgres and pgadmin containers
docker-compose up
- Run the injest_data container:
python ingest_data.py \
--postgres_user="root" \
--postgres_password="root" \
--postgres_host="localhost" \
--postgres_port="5432" \
--postgres_db="ny_taxi" \
--postgres_table="yellow_taxi_trips"
- Run the pgadmin container:
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-p 5050:80 \
dpage/pgadmin4
Day 4
- Created taxi_zones table in the ny_taxi database from upload_data.ipynb notebook
- Ran some queries on the yellow_taxi_trips and taxi_zones tables
SELECT COUNT(*) FROM yellow_taxi_trips;
SELECT COUNT(*) FROM taxi_zones;
SELECT
tpep_pickup_datetime,
tpep_dropoff_datetime,
fare_amount,
CONCAT(zpl_id."Borough", ' / ', zpl_id."Zone") as PickupLocation,
CONCAT(zdl_id."Borough", ' / ', zdl_id."Zone") as DropoffLocation
FROM
yellow_taxi_trips as t,
taxi_zones as zpl_id,
taxi_zones as zdl_id
WHERE
t."PULocationID" = zpl_id."LocationID" AND
t."DOLocationID" = zdl_id."LocationID"
LIMIT 100;
SELECT
CAST(t."tpep_pickup_datetime" as DATE) AS PickupDate,
COUNT(1) as count
FROM
yellow_taxi_trips as t
GROUP BY (CAST(t."tpep_pickup_datetime" as DATE))
ORDER BY count DESC
LIMIT 100
- Created GCP free tier account and created a new project.
- Created a new service account and downloaded the json key file.
- Edit the service account permissions to give it access to the GCP resources like BigQuery-Admin, Storage-Admin, StorageObject-Admin.
- Download the gcloud sdk and installed it.
brew install --cask google-cloud-sdk
- Setup environment variables for the service account
export GOOGLE_APPLICATION_CREDENTIALS="<path-to-service-account-json-file>"
- Authenticated the gcloud sdk
gcloud auth application-default login
- Download terraform and installed it.
brew install terraform
- Created a terraform.tfvars file.
- Created a main.tf file to create the GCP resources.
- CReated a variables.tf file to define the variables.
- Ran the terraform init command to initialize the terraform.
terraform init
- Ran the terraform plan command to see the changes that will be made.
terraform plan
- Ran the terraform apply command to create the resources.
terraform apply
Day 5
- Created gcp-de-2023 ssh key pair
ssh-keygen -t rsa -f ~/.ssh/gcp-de-2023 -C i.mahesh.sinha -b 2048
- Added the public key to the GCP Compute Engine Metadata
- Created a new GCP Compute Engine instance from the GCP console.
- ssh into the instance
ssh -i ~/.ssh/gcp-de-2023 i.mahesh.sinha@<instance-ip>
- Download the Anaconda installer
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh
- Updated the .ssh/config file to add the GCP instance
Host de-zoomcamp-2023
HostName <INSTANCE-IP>
User <USERNAME>
IdentityFile <PATH-TO-SSH-KEY>
- Install docker
sudo apt-get update
sudo apt-get install docker.io
- Enable docker without sudo
sudo groupadd docker
sudo gpasswd -a $USER docker
<REBOOT THE INSTANCE>
- Install docker-compose
mkdir bin
cd bin
wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-linux-x86_64 -O docker-compose
chmod +x docker-compose # make it executable
- Add below line in
.bashrc
file, so that we can run docker-compose without specifying the full path
export PATH="${HOME}/bin:${PATH}"
- Clone the repo
git clone <REPO URL>
- Run the docker-compose file
docker-compose up
- Install pgcli
conda install -c conda-forge pgcli
pip install -U mycli
Day 6
- Download terraform on the GCP instance
wget https://releases.hashicorp.com/terraform/1.3.7/terraform_1.3.7_linux_amd64.zip
unzip terraform_1.3.7_linux_amd64.zip
terraform --version
- Copy the GCP service account json file to the GCP instance using sftp
sftp de-zoomcamp-2023
mkdir .gc
put <LOCAL-PATH-TO-SERVICE-ACCOUNT-JSON-FILE> .gc
- Configure the GCP credentials using the service account json file
export GOOGLE_APPLICATION_CREDENTIALS="<PATH-TO-SERVICE-ACCOUNT-JSON-FILE>"
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
- Copied the terraform state file from local to GCP instance
sftp de-zoomcamp-2023
put <LOCAL-PATH-TO-STATE-FILE> terraform.tfstate
- Ran the terraform init command to initialize the terraform.
terraform init
- Ran the terraform plan command to see the changes that will be made.
terraform plan
- Home work Week-1 NOTEBOOK URL:
week-1-env-setup/Homework-Solutions Week-1.ipynb