merenlab/anvio

[FEATURE REQUEST] adding contigs database names to deflines of exported genes/proteins fasta

Closed this issue · 3 comments

The need

Currently, when exporting gene/protein fasta files from genomes using anvi-get-sequences-for-gene-calls, the identifiers are just numbers, whereas in the GFF export the IDs are in the format <contigs_db_name>___. I would love these consistent, and stongly prefer the second format with the three underscores.

The solution

Either change the default of how locus tags are exported, or perhaps more elegantly, add a --name-in-defline (or something like that) option to anvi-get-sequences-for-gene-calls in non-gff mode

Beneficiaries

Those who export sequences from many files, and don't want all of them to start at "0"

Thanks for this, @dspeth. I think it is time we think about a flexible way to export defline information for all FASTA files. In an ideal world, the user should be able to specify exactly how they would like the defline of their FASTA file should look like. For instance, if there was a flag for these programs that export FASTA files like --defline, we could use it the following way:

(...) --defline '{genome_name}_{gene_caller_id} {gene_function}

And it would give a FASTA file that looks like this:

>HTCC1060_3 COG:XX;KOfam:YY;PFam:ZZ
(...)

Versus,

(...) --defline '{gene_caller_id} {genome_name}

Would yield something like,

>3 genome_name:HTCC1060
(...)

Versus,

(...) --defline '{gene_caller_id}'

would yield,

>3
(...)

The technical problem here is that there are many places in the code where deflines are being defined on the fly. We can implement a global flag, --list-defline-options which would be caught everywhere in the code that is crafting deflines, and would share the 'keys' that can be used in that specific context, and then the user would use those keys with the global --defline flag, that would also be captured in the same context to divert from the defaults of the context.

I'll try to think about this more once I have time, but if someone wants to take a stab, they should feel free to do so in the meantime :)

Hi Meren,
your proposed general solution is much more expansive than what I suggested. If that's doable, i'd of course be happy with that. Otherwise, having the option treat the defline the same as the in the GFF output so that there's internal consistency within the anvi-get-sequences-for-gene-calls would also already be a small step forward.

That said, I get the desire to do this once, and do it right, rather hacking a patch together. It's also not urgent from my side, so far I've just postprocessed the fasta headers, but that seems like an unnecessary step.

This branch is finally merged. Thank you again for your input and guidance, @dspeth.