citiususc/SparkBWA

Error when using Azure blob storage paths for FASTQ files and SAM output file

Closed this issue · 1 comments

I was trying to run SparkBWA on Azure's HDInsight with blob storage as the HDFS.

My two fastQ files and the jar file are on blob.

I specify spark-submit as follows:

spark-submit --class com.github.sparkbwa.SparkBWA --master yarn-cluster --verbose wasb://container-name@account-name.core.windows.net/folder/SparkBWA-0.2.jar -a mem -p -w "-R @RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589" -i wasb://container-name@account-name.core.windows.net/folder/hg38.fa wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read1.fastq wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read2.fastq wasb://container-name@account-name.blob.core.windows.net/folder/Output_LP6005083-DNA_B03
My cluster is has 3 worker nodes, each with 16 cores and 112GB of memory.

When I submit this job, the program fails by declaring that no input and output have been specified. Obviously, I have. It is just that they are proper AZURE blob paths.

Below is the error messages along with some context. Do I need to specify the blob storage paths differently? Does this work with public cloud clusters using AWS S3 or Azure Blob?

17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -a
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: mem
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -p
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -w
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -R @RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -i
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/hg38.fa
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read1.fastq
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read2.fastq
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/Output_LP6005083-DNA_B03
17/04/10 15:43:48 ERROR BwaOptions: [com.github.sparkbwa.BwaOptions] No input and output has been found. Aborting.

SparkBWA has been developed to use HDFS. The access to this filesystem is hardcoded. If you want to use a different filesystem you need to change all the parts in the source code where an access to the HDFS is used.