SparkBWT

About the project

SparkBWT is a tool for calculating the Burrows-Wheeler transform (BWT) on Apache Spark Framework.

Build with

The application has been developed using maven, the main languages are java and scala. To improve the performance it has been used C/C++ languages too, integrated through JNI. In the development of the application was used the framework Apache Spark.

Structure

The source code is the src folder. Inside there are the main folder that contains the code for the application and the test folder that contains the code for testing the classes in main folder.

In src/main we can find:

java contains the JNI glue code and the code for CLI.
native contains the native code, that is the c++ procedure for sorting based on Radix-Sort algorithm
scala contains the implementation of the algorithm in Apache Spark..

Getting started

Prerequisites

The building of the project can be made automatically with maven, but this requires that the following tools are installed in the system:

make
g++

For building in Windows environment you have to use MinGW and CMake.

Build

To build the project from command line:

git clone https://github.com/MR6996/spark-bwt
cd spark-bwt
mvn package -P [profile]

The profiles are window and linux depending on your operating system.

In the created /target folder, we can find the jar file needed to run the application (Should be named as spark-bwt.jar).

Usage

The tool can be launched using the tool provided by default by Apache Spark spark-submit. Can be used a YARN cluster and can be used the option parameters for configuration. A typical usage is:

spark-submit [options] spark-bwt.jar <filename>

for help:

spark-submit spark-bwt.jar -h

License

The project is distributed under GPL v.3 License More info

References

[1] Mario Randazzo, Simona E. Rombo A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform