Realest Data
In order to facilitate the modernization of Realest Estate Crop., we present this project: Realest Data, a modern data platform built ontop of Spark and Jupyter.
Platform
The platform is composed of two main components, a set of ETL jobs and a set of analytics notebooks.
Extract
- Spark is used for the heavy lifting, performing extracting and transformations on data.
Explore
- Juypter is used to enable the more technically-savvy to get their hands dirty with the data, either from the raw logs or from a database, in a reproducible and shareabe manner utilizing PySpark as an interface to Spark and Seaborn for visualization.
Setup
Setup
To work on or setup Realest Data, install the following tools according to their respective websites:
There may be some other things that require installaton to get everything working on your own machine, some googling will resolve the majority of these issues, however if that doesn't work. Feel free to create an issue so that this installation guide can be made more universal.
You can check if you have all the tools with:
$ make check
Explore Setup
First we need to install the python dependencies, preferably within a virtualenv.
$ make setup_pyspark
For more information about Seaborn and Jupyter
NOTE: In order to export Notebooks to PDFs, you'll need to install pandoc and TeX as detailed here
Use
Extract
jobs
is a SBT project that contains Spark applications written in Scala which operate on some data somewhere and saves the results of those tranformations on that data in another place.
Submitting a job is as easy as exporting the Master node to send to job to and using submit_job
specifying the job (package.class format) you want to submit.
For example to submit the TestJob to a local cluster (after setting one up of course with make start_local
):
$ export SPARK_MASTER=spark://$(hostname):7077
$ make submit_job job=com.realest_estate.TestJob
Remember to teardown the cluser if you're done using it with make kill_local
Explore
notebooks
is the place to hold Juypter notebooks, providing an easy, yet powerful environment for manipulating data at any level in an ad-hoc, yet documentable, fashion. Investigations, prototypes, and research can all be performed in Python on top of PySpark and Seaborn. More languages and frameworks can be supported in the future.
Run which python
and ensure that it's pointing to the python with the dependencies you need. If it isn't:
$ export PYSPARK_PYTHON=<path_to_python_exe>
To run a notebook server in a Spark cluser:
$ # export the url of the master node to `SPARK_MASTER`
$ export SPARK_MASTER=spark://<master_hostname>:<master_port>
$ make pyspark
Testing
Jobs
Once everything is installed, to start a local cluster, run a test job, and verify your local setup:
$ make test_local
NOTE: If this is your first time with sbt or spark, this might take a little while as sbt has to download the right versions of itself, Scala, and the project's dependencies. Also, for building the local cluster and tearing it down you may be asked for your password to connect locally to ssh.
Notebooks
If a Spark cluster isn't running, you can do so locally with make start_local
. EExport the location of the Master node to SPARK_MASTER
and run the Jupyter notebook
$ export SPARK_MASTER=spark://<master_hostname>:<master_port>
$ make pyspark
A Jupyter notebook will open in your browser, opened to the notebooks
folder. Select the Test
notebook and run it. If everything is cool, no error will be thrown, and you can go about your day.
Example
In fixtures/
you'll find a CSV file called daily_properties
. This is the training data of the Housing Prices Kaggle Challenge. In this example, we'll run a job to transform this Data into something that we'll use in a notebook.
First setup according to the steps listed above, including the section about "Explore". Once everything is configured, start a local cluser:
$ make start_local
Once, the cluster is up, check out its status at http://localhost:8080/
.
You should have one worker and a Master node at port 7077 on your local machine.
Run the TranformDailyProperties
job:
$ make submit_job job=com.realest_estate.TransformDailyProperties
Once that completes, if everything when well, the following command shouldn't error:
$ cat fixtures/transformed_daily_properties.csv/_SUCCESS
Start the Jupyter notebook server with:
$ make pyspark
Check out the Questions Investigation
notebook, this has a bunch of different answers to a variety of questions about the data; run it and you should see how you're able to read from the data returned by the job and perform queries on it.
Once your done, do you computer a favor a teardown the cluster.
$ make kill_local
TODO
Right now, while this system is uncoupled and organized, the lack of structure makes it rather haphazard to use the pieces in conjunction, due to implicit timing dependencies and inferred schemas. Moreover, the platform is currently without a concrete database layer which will make informal analysis by non-technical users (among other tasks) difficult.
Various things that can be done to productionize this platform:
- Provision and orchestrate a Spark cluster using Terraform and Kubernetes
- Provision and configure a Jenkins service to be able to structure pipelines from jobs with Terraform, Ansible, and Docker
- Establish a schema registry so that jobs and databases can work in sync with different data shapes.
- Design codepaths to provision and/or configure Databases to be populated by Lambda functions that run when jobs deposit their results in S3.
- Adjust SBT so that only the job being submitted (and it's dep) get packaged
This is only some of the possible directions this project can go, it really depends on the business needs.