
chunk HiC mapping step

iggyB opened this issue · 3 comments

iggyB commented


Nice job on Falcon-Phase!

Mapping HiC reads is a step that takes quite some time. This could be rather easily improved for those who run pipeline on a cluster by splitting HiC reads into chunks. There are a number of tools that can do it, but even a simple split does the job.

I did it by simply getting read pair number, dividing it by njobs (or number of available nodes) to get approx chunk size and using it to split reads:
zcat reads.R1.fastq.gz | split -l 40000000 - reads.R1.10M
zcat reads.R2.fastq.gz | split -l 40000000 - reads.R2.10M
Then it was just to submit separate mapping jobs and merge BAM files to get "aln.unfiltered.bam".

This can be easy integrated (or done in more neat way) into your workflow, especially pb-assembly branch. The machinery is there and you even use it when generating "haplotig.placement" file by running multiple mummer jobs.


zeeev commented

Hi @iggyB,

Thank you for the kind words, we are happy people are using the code! First let me offer up the simplest solution to long mapping times, increase the number of cores available to bwa-mem (in the config.json). That being said, your solution is optimal. I've tagged this issues as an enhancement. My development time is limited, but I'll see what I can do to get this implemented. If you end up coding the method I'd be happy to take a pull request.



iggyB commented

Hej @zeeev,

It's a big step towards (nearly) fully phased polyploid assemblies. Perhaps not that many projects posses required data sets, but with technology and methods spreading, I'm sure more and more people will be interested in running Falcon-Phase.

I did change core number - was of course obvious thing to do :) But then I anyway decided to create alignment outside the pipeline.
One more suggestion: different steps in pipeline should be configurable to use different amount of resources (like in Falcon/Falcon-Unzip).

Time is an issue. I'll be happy to share the code if I get it done before you do :)


We haven't gotten to this and it hasn't come up again, so we will close this as we do not currently plan to implement chunking natively (which would be required in some fashion to parallelize the other major time sink, the phase command. If there are any other requests for this feature, please reactivate this issue.