spark-blast

Using Parallel Blast on Apache Spark.

To use Parallel Blast on Apache Spark you have to create a cluster with Apache Hadoop and Apache Spark. You must install the Blast, after that set up the file "executablast.sh" and distribute it among nodes.

You must follow steps:

1 - Split file that contain a query;

$./dividi-query

2 - Execute spark-shell in a window;

$spark-shell --executor-memory 5999m --driver-memory 5999m --num-executors 128 --executor-cores 1 --driver-cores 128 // you can specify some atributes, use --help for that.

3 - In other window execute one script that disbrituted the split querys in step 1;

$./distribui

4 - On spark-shell execute that steps.

scala> val data = “query.fa”

scala> val dataRDD = sc.makeRDD(data)

scala> val script = “/home/hadoop/spark-install/bin/executablast.sh”

scala> val pipeRDD = dataRDD.pipe(script)

scala> pipeRDD.saveAsTextFile(“output”)

5 - In this point you have executed de Spark Blast. Now you have to merge the parts generated with Apache Spark, so execute:

$ hadoop fs -getmerge gs://hadoop-spark-fiocruz/output output.blast

6 - And now, put that file in repository, we use a Google Storage:

$ hadoop fs -copyFromLocal output.blast gs://hadoop-spark-fiocruz/

sparkblastproject/v1

spark-blast