Spark Modularized View enables users to build enterprise scale applications on Apache Spark platform.
- Scales with DATA size
- Scales with CODE size
- Scales with TEAM size
In addition to the data scalability inherited from Spark, SMV also provides code and team scalability through the following features:
- Multi-level modular design allow developers to work on large scale projects, and enable easy code and data reuse
- Multi-grain traceability to support full scope knowledge transparency to developers and data users
- Provides interfaces to multiple languages(Scala and R for now) for easy integrating to existing code and leverage existing developer experiences
- Pure text code, can utilized modern CM (Configuration Management) tool to track and merge changes among team members
- Automatic Data and Code version synchronization to enable coordination on both code and data level
- Data publishing mechanism to support inter-team coordination
- Build-in data quality management to ensure data quality in a continuous bases
- High level helper functions and tools for quick data App development
Please refer to User Guide and API docs for details.
Note: The sections below were extracted from the User Guide and it should be consulted for more detailed instructions.
Install Docker. An installation guide for your machine may be found here.
The first time you run the smv Docker image, Docker will download it for you automatically. You need to tell Docker where to find your projects directory and your data directory. You will enter a shell with all SMV tools installed. Find your projects in /projects and your data in /data. Note that both the projects and data directories must already exist on the host system.
$ docker run -it -v /path/to/projects:/projects -v /path/to/data:/data tresamigos/smv
The smv-core image contains only the tools needed for SMV development, so you may build SMV from source with mvn or sbt. Find your SMV source in /smv and your projects in /projects in the container.
Run smv-core.sh from _SMV_HOME_/docker/smv-core
_SMV_HOME_/docker/smv-core/$ ./smv-core.sh /path/to/projects
or any other directory
/any/other/directory$ _SMV_HOME_/docker/smv-core/smv-core.sh /path/to/projects /path/to/smv
SMV provides a shell script to easily create an example application. The example app can be used for exploring SMV and it can also be used as an initialization script for a new project.
$ _SMV_HOME_/tools/smv-init MyApp com.mycompany.myapp
$ mvn clean install
$ _SMV_HOME_/tools/smv-run --run-app
The output csv file and schema can be found in the data/output
directory (as configured in the conf/smv-user-conf.props
files).
$ cat data/output/com.mycompany.myapp.stage1.EmploymentByState_XXXXXXXX.csv/part-* | head -5
"32",981295
"33",508120
"34",3324188
"35",579916
"36",7279345
$ cat data/output/com.mycompany.myapp.stage1.EmploymentByState_XXXXXXXX.schema/part-*
FIRST('ST): String
EMP: Long
See Getting Started section of User Guide for further details.
If smv-run
is provided the -g
flag, instead of running and persisting the module, the module dependency graph will be created as a dot
file. It can be converted to png
using the dot
command.
$ _SMV_HOME_/tools/smv-run -g -m com.mycompany.myapp.stage1.EmploymentByState
$ dot -Tpng com.mycompany.MyApp.stage1.EmploymentByState.dot -o graph.png
See Run SMV Application for further details.