This is a simple project that showcase one of the most predominant ETL/ELT jobs of a Data Engineer -- DATA REPLICATION
. Data replication should not be confused with DATA MIGRATION
, which in most case is carried out once and the source database system is abandoned after the task is finished.
Data Replication often refers to the periodic copying of data
from a data source on one platform to a destination on another one without discarding the data source
.
Both process does advocate for the moving data from a source(s) platform to other destination(s) platform.
ETL -- Extraction-Transform-Load, a classical approach.
EtLT -- Extraction-t(minor transformation e.g validation checks, reduplication)-Load-Transformation. Cases including EtLT arises if the Data Analysts is charged instead with implementing/Transforming the business logic into the data schema not the Data Engineer.
Early-Intermediate Friendly 😎 😎
-
Use python, to Populate (using the
faker library
) our source relational database (MySQL) -
Use python, to Extract the data into Apache Spark, as our transformation engine
-
Use python, to Transform the data using Pyspark dataframe and spark SQL
-
Use python, to Load the data into a Postgres database
-
Use python, to Orchestrate the workflow with Apache Airflow
- Relational Database -- Postgres, MySQL
- Big Data Processing Tool -- Apache Spark
- Workflow and Orchestration -- Apache Airflow
- Languages -- Python and SQL
A good understanding of these syntax style in python and SQL would help:
- python try/exception block
- SQL DDL syntax
-
Install and setup Apache Airflow, Do check the internet for more information, but I'll recommend this Medium article for Ubuntu (all Linux actually, I'll assume 🤓 🤓). Note, no need to install extra dependencies with airflow,
pyspark
andsklearn
are not needed. -
Inside the cloned repo, create your virtual env, and run
pip install -r dependencies.txt
in the activated virtual env to set up required libraries and packages. -
Edit the
/login_credentials/connection_profile.conf
for applicable information. This is important. -
Edit line 7 of
sourcing_and_setting_up_db.py
to include the absolute path of the cloned repo. -
Move only
dag.py
to~/airflow/dags/
folder to be detectable by your Airflow Scheduler. -
If just want to read through, I'll recommend using the jupyter note
.ipynb
but if you are interested in the workflow, follow the.py
and do tigger the dag manually in your Airflow web UI or editdag.py
to schedule your runs. -
To run your jupyter note in either Jupyter or VS code with your virtual env, you need to run this two lines on your bash while your virtual env is activated:
$ pip install ipykernel
$ ipython kernel install --user --name=<any_name>.
Remember to select the installed kernel in the Anaconda Jupyter Notebook
-
We can improve the project by expanding our source dataset and bringing Spark to it power with more complex transformation.
-
We can include cloud services (e.g AWS EMR, AWS AURORA) at any stage of our ETL/EtLT
Now go get your cup of coffee/tea and enjoy a good code-read and criticism. 👍 👍