One of the main obstacles of Data Engineering is the large and varied technical skills that can be required on a day-to-day basis.
*** Note - If you email a link to your GitHub repo with all the completed exercises, I will send you back a free copy of my ebook Introduction to Data Engineering. ***
This aim of this repository is to help you develop and learn those skills. Generally, here are the high level topics that these practice problems will cover.
- Python data processing.
- csv, flat-file, parquet, json, etc.
- SQL database table design.
- Python + Postgres, data ingestion and retrieval.
- PySpark
- Data cleansing / dirty data.
You will need two things to work effectively on most all of these problems.
Docker
docker-compose
All the tools and technologies you need will be packaged
into the dockerfile
for each exercise.
For each exercise you will need to cd
into that folder and
run the docker build
command, that command will be listed in
the README
for each exercise, follow those instructions.
The first exercise tests your ability to download a number of files
from an HTTP
source and unzip them, storing them locally with Python
.
cd Exercises/Exercise-1
and see README
in that location for instructions.
The second exercise
tests your ability perform web scraping, build uris, download files, and use Pandas to
do some simple cumulative actions.
cd Exercises/Exercise-2
and see README
in that location for instructions.
The third exercise tests a few skills.
This time we will be using a popular aws
package called boto3
to try to perform a multi-step
actions to download some open source s3
data files.
cd Exercises/Exercise-3
and see README
in that location for instructions.
The fourth exercise
focuses more file types json
and csv
, and working with them in Python
.
You will have to traverse a ragged directory structure, finding any json
files
and converting them to csv
.
The fifth exercise
is going to be a little different than the rest. In this problem you will be given a number of
csv
files. You must create a data model / schema to hold these data sets, including indexes,
then create all the tables inside Postgres
by connecting to the database with Python
.
The sixth exercise
Is going to step it up a little and move onto more popular tools. In this exercise we are going
to load some files using PySpark
and then be asked to do some basic aggregation.
Best of luck!
*** IN PROGRESS **
The seventh exercise
Again, we are going to try a project with another popular Big Data tool, namely
ElasticSearch
. Very different from the last project with PySpark
, but this
exercise will require more attention to detail and fine-tuning. You will
ingest a .txt
file into a locally running ElasticSearch
instance and then
retrieve some information from what you just stored.