peterjc/galaxy_mira

MIRA v4.0 de novo assembler does not output a collection for collection input

tshtatland opened this issue · 8 comments

When used in a workflow, MIRA v4.0 de novo assember does not output a collection, as expected when the input is a collection. I am attaching the workflow screenshot with red and green arrows that highlight the issue. I am using the latest version of the tool on a local Galaxy installation v.17.05:
MIRA v4.0 de novo assember Takes Sanger, Roche 454, Solexa/Illumina, Ion Torrent and PacBio reads (Galaxy Version 0.0.11)
Galaxy Tool Shed - https://toolshed.g2.bx.psu.edu/repository?repository_id=efe8c48b382cb9cc&changeset_revision=1713289d9908
I expected (perhaps incorrectly) most Galaxy tools, such as MIRA assembler, designed so that collection input (N fastq/fasta files, each with millions of reads) produces a collection output (N fasta files, each with a small number of contigs), as shown on the screenshot. N is the number of biologically distinct samples/libraries. From my point of view, most Galaxy tools should not be "reduced". The reducing step should probably be done by a simple reducing tool, later, otherwise the combo tool is not collection-friendly. I wonder if this naive view make sense... Thank you!
mira_collection_in_single_out

MIRA will happily take multiple FASTQ inputs, e.g. an organism sequenced with multiple libraries or runs, so yes, the tool does deliberately cater to mapping N input files to one assembly (i.e. "reduce" mode)

I'd hope the Galaxy GUI would allow you to deliberately choose to run N copies of MIRA instead (i.e. "map" mode), giving N assemblies.

Paging @jmchilton as the Galaxy collections expert.

Thank you for the quick response! Let me also use this opportunity to thank you for writing and maintaining MIRA tools, we use them frequently in our Galaxy instance!
I asked about this issue yesterday on the Galaxy gitter channel: https://gitter.im/Galaxy-Training-Network/Lobby
Perhaps I do not understand something fundamental here, but many Galaxy tools such as spades (also an assembler) and cat (concatenate inputs = M files into output = 1 file) are collection-friendly. An example workflow with spades (compared to MIRA, as is) is shown in the screenshot attached above. Below I am attaching a screenshot of a similar workflow, where I added also "cat" tool for comparison, and a version of MIRA with hacked xml. Everything seems collection-friendly now. As you can see, multiple inputs got into 1 output, but one can use this on a collection. This is the desired goal for me. But the hack is not production-ready. It is based on the suggestion from Bjoern:

Remove the multiple=True
and the for loop in the command section

mira_collection_in_single_out_spades_cat

The MIRA wrapper is collection aware (via multiple="true" on the input parameter which @bgruening mentioned). I would infer that the Spades wrapper is not collection aware, which would explain why you (only) get the default N inputs to N jobs behaviour from Galaxy (useful and practical for single input tools).

I've not had a chance to play with the interface to confirm how you'd run MIRA to get N jobs from N inputs, but I would expect the collections input control allowed that.

There are two different concepts here that are clashing I think.
@peterjc every tool is kind of collection aware in a sense, that if you don't do any magic in your tool description it will simply start X jobs for X datasets. This is what @tshtatland probably wants/expects.

You are referring probably to the multiple=True with collection aware, which is true, but means that a job from this tool will consume X datasets at once. Which results that the current Mira wrapper can not iterate over a collection.

So both approaches are correct but they imply different UX. One solution would be to remove multiple=True from Mira and rely on the fact that people can merge FASTA files before they start the assemble, if they really need to provide multiple FASTA files to the assembly.

I think this is a Galaxy UI limitation for tools with multiple="true", and welcome comment from @jmchilton - can Galaxy really not iterate over a collection in this situation?

I'd like to try this locally with Spades - @tshtatland which Spades wrapper are you using? Can you tell me the Tool Shed URL (as there are at least two different wrappers available)?

can Galaxy really not iterate over a collection in this situation

Correct - it cannot currently. There was an issue in Trello but I cannot find on Github for this so I've created galaxyproject/galaxy#4623. I included workarounds you can add to Mira if you want the tool to support both modes of operation. Certainly some Galaxy developers would discourage those workarounds - but I tend to be a bit more pragmatic I think.

Thanks John - this is a very difficult set of concepts to convey to the user, so I can understand why it isn't in the current Galaxy UI.

The suggestions you've given for workarounds make sense, but would I think break backward compatibility with the current versions of the MIRA wrapper. Given that, it would be nice for me to take more direct advantage of the paired collection infrastructure as part any changes to the input handling.

This is the spades tool that corresponds to the screenshot above:
spades SPAdes genome assembler for regular and single-cell projects (Galaxy Version 1.0)
https://toolshed.g2.bx.psu.edu/repository?repository_id=6a122c80d3c9733e
https://toolshed.g2.bx.psu.edu/view/lionelguy/spades/21734680d921