Error when using Azure blob storage paths for FASTQ files and SAM output file

Question

Error when using Azure blob storage paths for FASTQ files and SAM output file

Closed this issue 8 years ago · 1 comments

I was trying to run SparkBWA on Azure's HDInsight with blob storage as the HDFS.

My two fastQ files and the jar file are on blob.

I specify spark-submit as follows:

spark-submit --class com.github.sparkbwa.SparkBWA --master yarn-cluster --verbose wasb://container-name@account-name.core.windows.net/folder/SparkBWA-0.2.jar -a mem -p -w "-R @RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589" -i wasb://container-name@account-name.core.windows.net/folder/hg38.fa wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read1.fastq wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read2.fastq wasb://container-name@account-name.blob.core.windows.net/folder/Output_LP6005083-DNA_B03
My cluster is has 3 worker nodes, each with 16 cores and 112GB of memory.

When I submit this job, the program fails by declaring that no input and output have been specified. Obviously, I have. It is just that they are proper AZURE blob paths.

Below is the error messages along with some context. Do I need to specify the blob storage paths differently? Does this work with public cloud clusters using AWS S3 or Azure Blob?

17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -a
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: mem
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -p
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -w
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -R @RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: -i
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/hg38.fa
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read1.fastq
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/LP6005083-DNA_B03-read2.fastq
17/04/10 15:43:48 INFO BwaOptions: [com.github.sparkbwa.BwaOptions] :: Received argument: wasb://container-name@account-name.blob.core.windows.net/folder/Output_LP6005083-DNA_B03
17/04/10 15:43:48 ERROR BwaOptions: [com.github.sparkbwa.BwaOptions] No input and output has been found. Aborting.

Answer 1 · 2017-04-24T07:02:31.000Z

SparkBWA has been developed to use HDFS. The access to this filesystem is hardcoded. If you want to use a different filesystem you need to change all the parts in the source code where an access to the HDFS is used.