bigdatagenomics/cannoli

Cannoli on Spark

tahashmi opened this issue · 3 comments

Hi,
I have a question for running Cannoli on Spark cluster.

When I use the following command on 15 nodes cluster:
/home/tahmad/tahmad/bd/cannoli/bin/cannoli-shell --master spark://tcn663:7077 --deploy-mode client --driver-memory 5g --executor-memory 5g --num-executors 15 --executor-cores 20 -i cannoli/run.scala

cannoli/run.scala contains:

val fragments = sc.loadPairedFastqAsFragments("/scratch-shared/tahmad/bio_data/SRR010942/SRR010942_1.filt.fastq.gz", "/scratch-shared/tahmad/bio_data/SRR010942/SRR010942_2.filt.fastq.gz")
val args = new BwaMemArgs()
args.sampleId = "sample"
args.indexPath = "/home/tahmad/mcx/bio_data/ucsc.hg19.fasta"
args.executable="/home/tahmad/tahmad/BWA/new/bwa/bwa"
val alignments = fragments.alignWithBwaMem(args)
alignments.saveAsParquet("/scratch-shared/tahmad/bio_data/SRR010942/SRR010942.alignments.adam")

I get these warnings:
WARN NewHadoopRDD: Loading one large file file:/scratch-shared/tahmad/bio_data/SRR010942/SRR010942_1.filt.fastq.gz with only one partition, we can increase partition numbers for improving performance.
WARN NewHadoopRDD: Loading one large file file:/scratch-shared/tahmad/bio_data/SRR010942/SRR010942_2.filt.fastq.gz with only one partition, we can increase partition numbers for improving performance.

How can I do partitions?

And also it's taking too much time! (FASTQ data size is 6GB, 16/32 nodes clusters and more than 1 hour)

Hi @heuermh , A kind reminder!

Hello @tahashmi, sorry for the delay.

There are three considerations, first the FASTQ format is in itself rather inefficient. If your upstream processes could produce unaligned BAM (uBAM) format, that would be preferable.

Second, as the Hadoop warning describes, GZIP formatted files cannot be safely partitioned and must be read serially. Block-gzipped (BGZF) formatted files can be partitioned and read concurrently. There are a few tools that can do this conversion, please let me know if you need suggestions.

Third, Apache Spark and bwa both want as much RAM as possible. In production mode on our internal cluster we typically use --driver-memory 16G --executor-memory 32G and leave a significant amount of RAM unallocated by Spark for use by bwa. The mechanisms for doing this vary by how your cluster is managed. If you are seeing poor performance, you may want to monitor your nodes to see if garbage collection is a significant issue.

Hope this helps!

Thanks for the reply.

For the second option, please give any suggestion. I can try.

Regarding third point, I am using Cartesius cluster with thin nodes (with 2, 4, 8, 16, or 32 nodes settings) using Slurm (each nodes has 64 GB memory). If I want to process 30x[1] or 40x[2] data, can you please suggest some high performance Spark options (like --driver-memory 16G --executor-memory 32G) for Cannoli BWA?

Can you please share your cluster job script like slurm and Spark command with options? It will be very helpful.

I also saw on each worker node multiple BWA instances are being run with only one thread. Is this the performance efficient way (i.e, running multiple BWAs with only one thread on each worker node)?

Thanks!

[1] ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/
[2] ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/