/lake-cabin-project

This is a repository to hold the final project of team lake-cabin

Primary LanguagePython

Lake-cabin-project

This is a repository to hold the final project of team lake-cabin. The team members are Anthony Wong, Samuel Fowler, Shehryar Mughal, Vasile Condrea and Hana Mohamed.

This project aims to showcase skills and knowledge in the field of data engineering by creating applications that extract, transform, and load data from a prepared source into a data lake and warehouse hosted on AWS. The focus was on creating reliable, resilient solutions that are deployed and managed in code.

The project requires the use of Python, SQL, database modeling, AWS, and agile working practices.

The project includes the following components:

  • Two S3 buckets: one for ingested data and one for processed data, both structured and organized for easy data access.
  • A Python application that loads the data into a prepared data warehouse at defined intervals. The application has been adequately logged and monitored.
  • A Python application that remodels the data into a predefined schema for a data warehouse and stores the data in parquet format in the "processed" S3 bucket.
  • A Python application that loads the data into a prepared data warehouse at defined intervals.

All of the Python code has been tested, and as much of the project as possible has been deployed automatically using CI/CD techniques.

DEPENDENCY REQUIREMENTS

To run the deployment script, you'll need to install the following dependencies:

TRANSFORMATION SCRIPT

There are two ways to get the necessary packages for the transformation script:

Method 1

These packages can be found hosted on the Python Package Index (https://pypi.org/):

  • numpy-1.24.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • pandas-1.5.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • pyarrow-10.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • pytz-2022.7-py2.py3-none-any.whl

To install these packages:

  • Create a package directory in the path Data_Manipulation/src/data_transformation_code/.
  • Unzip the .whl files and place the resulting folders in the package directory created in -step 1.

METHOD 2

To install the packages using this method, follow these steps:

  • cd into the Data_Manipulation directory.
  • Run the command pip install -r package-requirements.txt -t src/data_transformation/package.

DEPLOYMENT SCRIPT

The deployment script requires the pg8000 package. To install it:

  • cd into the deployment directory.
  • Run the command pip install pg8000 -t src/ingestion-folder/package.

RUNNING THE SCRIPT

Before running the deployment script, make sure to do the following:

  • Set up AWS credentials using alias awsume=". awsume" and awsume [profile].
  • Confirm that your credentials are valid and you're connected to AWS by running aws sts get-caller-identity in your terminal.
  • Add your database credentials to the deployment/db_creds_source.json and deployment/db_creds_destination.json files.

To run the script, cd into the deployment directory and run the command bash deploy_ingestion.sh.