spark-genome-alignment-demo

An example of bioinformatics and bigdata tools nicely playing together.

You can copy and paste the relevant section below (currently Mac OS X only) to see how the Bowtie aligner can be integrated into an interactive Spark program for doing bioinformatics work in a BigData environment.

Specifically what is being done below:

Build and install prerequisites

Java 1.6+
Apache Maven
perl JSON (sudo cpan JSON)
package manager (as needed)
Apache Spark
Scala
Bowtie
Big Data Genomics ADAM

Index the E.coli genome (NC_008253) that ships with Bowtie
Generate a set of positive-control FastQ reads from NC_008253
Launch spark-shell, the interactive interface to Spark
Align the control reads with Bowtie from spark-shell
Write the aligned reads out in SAM format

Set up the environment

Mac OS X

If you haven't already, install Homebrew:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Now we're ready to get to work:

brew install apache-spark
brew install scala
git clone https://github.com/allenday/spark-genome-alignment-demo.git
cd spark-genome-alignment-demo
#we'll assume that wherever you are now is where you want to work
export DEMO=`pwd`
mkdir -p build/data
cd $DEMO/build

#save time on mac, just use the pre-built bowtie from homebrew
brew install homebrew/science/bowtie
bowtie-build /usr/local/Cellar//bowtie/1.1.2_1/share/bowtie/genomes/NC_008253.fna $DEMO/build/data/NC_008253
cat /usr/local/Cellar//bowtie/1.1.2_1/share/bowtie/genomes/NC_008253.fna | sort | tail -50 | perl -ne 'chomp;$q=$_;$q=~s/./B/g;printf qq(\@read%i\n%s\n+\n%s\n), ($., $_, $q)' > $DEMO/build/data/reads.fq

#or do it from source...
#git clone https://github.com/BenLangmead/bowtie.git
#cd $DEMO/build/bowtie
#make
#./bowtie-build genomes/NC_008253.fna $DEMO/build/data/NC_008253
#cat genomes/NC_008253.fna | sort | tail -50 | perl -ne 'chomp;$q=$_;$q=~s/./B/g;printf qq(\@read%i\n%s\n+\n%s\n), ($., $_, $q)' > $DEMO/build/data/reads.fq

#verify bowtie functions as expected
cat $DEMO/build/data/reads.fq | ./bowtie $DEMO/build/data/NC_008253 - | md5sum
#should yield ecd5e41dea9692446fa4ae4170d6a1e1
cd $DEMO/build
git clone https://github.com/bigdatagenomics/adam.git
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.4.1
cd $DEMO/build/adam
mvn package install
export ADAM_HOME=`pwd`

Run the demo

cat $DEMO/bin/bowtie_pipe_single.scala | $ADAM_HOME/bin/adam-shell
reset
cat $DEMO/build/data/reads.sam | md5sum
#should yield 6eebbde8d7818136e9ab924d57af8005

#examine the outputs
head $DEMO/build/data/reads.sam

dbsiegel/spark-genome-alignment-demo

spark-genome-alignment-demo

Set up the environment

Mac OS X

Run the demo

Further reading