The application code reside in src/
and the tests are in tests/
and data is located in data/
.
In the root code folder src/
we mainly have:
read_write
holds code to read/write datacomputations
contains all data pipelineutils
holds a set of utilities used across the code basemain.py
defines a single entrypoint to the data pipeline. It contains amain
function
Here we maintain all needed code for data related tasks: e.g. computing metrics
. In the
sections below, we detail further each submodule.
-
schema.py
defines all the schemas for all tables we have. For exampletweets
would define all the column names and column fields of thetweets
table, similarly,project metrics
defines all the columns we generate in metrics computations etc. -
task.py
where we define abstract classes forDataTask
, which holds common information used downstream in each corresponding submodule. -
tweets_counts.py
a data task which computes basic metrics on tweets.
Note: the graph below will show rendered and very nice on GitHub as well by the end of Q1 2022 . To enable this in PyCharm, Go to Settings -> Languages & Frameworks -> Markdown and install then enable the Mermaid extension reference.
flowchart TD
DataTask --> TweetsCount;
The root element is the DataTask
(file) abstract object which mainly defines an interface
to be implemented by the inheriting material class. Mainly, it considers the following:
- initialization of the
JobContext
containing a spark session - definition of functions that any implementing class shall provide or inherit.
In short, the functions are meant to do the following:
read
gets the data,transform
the data by performing the needed computation and returning a Spark DataFrame.preview
is implemented in theTask
class where mainly we firstread
, thentransform
and return the resulting DataFrame.
- install spark installation guide
- install requirements
pip install -r requirements.txt
- run entrypoint