assemblerflow/flowcraft

Output assembly graph in GFA format [feature request]

sjackman opened this issue · 14 comments

I'd love for Flowcraft to output an assembly graph in GFA format by the assemblers that are capable of it, SPAdes in particular.

This can be easily added, I think. In the case of SPAdes, both the GFA and FASTG files are already generated, so it's just a matter of adding them to the publishDir. From the assemblers currently supported in flowcraft, I think only megahit is able to produce the assembly graph data, though it requires some processing (described in https://github.com/voutcn/megahit/wiki/Visualizing-MEGAHIT's-contig-graph). We could create a small process that receives the contigs for each kmer size and add the FASTG output to the publish dir.

In either case, the final output would be published as:

publishDir "reports/assembly/spades_{{ pid }}/${sample_id}"
# Or, if we rename the GFA/FASTG files
publishDir "reports/assembly/spades_{{ pid }}/"

Excellent! Would you like to point me in the right direction (file and line) to open a PR to add this feature?

I've found the line that needs to be modified:

publishDir 'results/assembly/spades_{{ pid }}/', pattern: '*_spades*.fasta', mode: 'copy'

Can you elaborate on what you mean by Or, if we rename the GFA/FASTG files ?

Yes, that is exactly the place. We've started a PR for that #139 and it already contains the spades modification. It's only missing megahit, where we'll need to add a small process to generate teh fastg file (hopefully during the weekend).

What I meant is that, by default, files generated by spades (or generally nextflow components) don't need to have a unique name. Not until you need to merge files from different samples, or publish them, which is the case. In the PR, I've kept the original file names, but published the results to the results/assembly/spades_<pid>/<sample_id>/ folder

Excellent! I'm planning on creating a module for Unicycler (short read only for now) and model it after the SPAdes module.

Oh and ABySS too! which can also output GFA.

Awesome! Looking forward to those additions!

I've added the option to convert and retrieve the fastg files from megahit. As far as I'm aware, there's no option to convert the fastg files to GFA with the megahit toolkit, but if we stumble upon a way to do it, it ca be easily added in the process. As a warning, I managed to fill 1.7 TB of disk space with the fastg while testing so use this option with caution ! :P

The request is now live in dev and should head to master in the next release 🎉

Excellent! There's this tool from @lh3 to convert FASTG to GFA. I haven't used it myself.
https://github.com/lh3/gfa1/blob/master/misc/fastg2gfa.c

Wow, 1.7 TB! I managed to create a >1 TB PAF.gz file the other day when aligning 18 flow cells of Nanopore reads to themselves with minimap2. I believe that was the first time that I broke a terabyte with a single (compressed!) file.

I tested out the GFA output of SPAdes, and it works! Thank you!
I expected to find the GFA file in results, and I was surprised to instead find it in reports. Is there a reason for it to be in reports rather than results, and could it be moved to results?

That is indeed a bug! Fixing now! 😺

gfa files are now stored under results/ as intended.

Thanks, Inês!