This is the code repository for the big data quality management paper titled 'SparkDQ: Efficient Generic Big Data Quality Management on Distributed Data-Parallel Computing'.
The project is based on Apache Spark, a unified analytics engine for big data. The Scala library for SparkDQ can be assemblied by maven.
The detailed setup and running methods are the same as an Apache Spark application.
- Git
- Java 1.8
- Hadoop 2.7.4
- Spark 2.2.0
- Python 3 with pip
- Anaconda 3
- Maven 3.5.4
When installing hadoop and spark, make sure that you have basic environment variables for them, such as HADOOP_HOME
, SPARK_HOME
and PATH
.
To install SparkDQ, all nodes of your distributed cluster should be have an Anaconda environment.
The Scala library can be packaged with the command mvn assembly:assembly
. Packaged JARs should be put in a Python application directory.
- Clone source code
git clone https://github.com/PasaLab/SparkDQ
- Generate SparkDQ installer
python setup.py sdist
An distributed installer could be found in
dist
directory.
- Deploy SparkDQ in all nodes' anaconda environment
pip install sparkdq-1.0.tar.gz
- Prepare SparkDQ applications
SparkDQ presents a number of test applications in
test
directory. You can use them directly, or try to write your own application accordingly.
- Test SparkDQ
SparkDQ presents an interactive webpage to import SparkDQ applications and run it. Launch Jupyter notebook like this:
nohup $JUPYTER_PATH$ notebook > log.out 2>&1 &
You can see detailed log in the directory where you run this command.
After that, you can visit the webpage, like
slave001:8888
, and then import packages for tests. The default port is8888
. You can change it according to your requirements.