adding lineage manipulation & taxonomy reporting in more places in sourmash?
ctb opened this issue · 22 comments
We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).
The basic idea here is that we combine taxonomy with search and gather on signatures/SBTs/LCAs, by connecting identifiers with lineages -- do dynamically what sourmash lca index
does once.
This falls under use cases of taxonomic filtering and reporting.
What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by sourmash search
and gather
and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.
I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.
It might be possible to add taxonomy/lineage-aware selector functionality on signatures and databases, e.g. "run a gather on this database, but ignore all results that are not Archaea."
Should also look at sourmash lca gather to see about collapsing results taxonomically...
Thoughts welcome!
related conversations
- the ZipStorage work in #648 makes it plausible to package SBTs up together with a lineage.
- related conversation in this comment in #581
(Sorry, wrong button)
We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).
What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by
sourmash search
andgather
and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.
Another example: this is the script I'm using for thesis work and 2020-cami: https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py
It's using pandas
(for summarizing at different taxonomic levels) and the taxonomy
package for parsing taxonomies, and I avoided bringing this into sourmash because they are heavy-ish deps, but having a sourmash taxonomy
subcommand for working with similar tasks (and bringing the lineage parsers necessary for LCA databases, for example) seems a good idea.
(Converting from an SBT to an LCA index like in https://github.com/luizirber/2020-cami/blob/f7f34d3903cd7d2b6bc2ac0471f7d53a42aa86b2/rules/build_indices.smk#L64L66 without depending on external code would be pretty cool. I guess the revindex
part is already doable today?)
I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.
Totally agree, and I think it also hurts functionality. Deposited genomes in genbank or refseq don't change (but they might get new versions), but the accession to taxid mapping CAN change (see: recent Lactobacillus genera being split).
It is so hard to deal with the NCBI taxonomy because it changes and doesn't have stable downloads for a specific date, so while generating taxinfo and putting it into SBTs has the benefit of capturing the taxonomy at one point in time (and being very convenient to use), but the drawback of potentially being outdated.
the ZipStorage work in #648 makes it plausible to package SBTs up together with a lineage.
Yup, but would also like to see some option to override it somehow (re: outdated taxonomies)
OK, a few more thoughts and some summarization --
- splitting LCA/SBT hash storage from lineage storage as per #581 is a good idea. this would let us separate issues of hash storage (DNA vs protein vs dayhoff vs hp, skip-mers, abundances, and so on) out from lineage manipulation.
- adding taxonomy utilities as a separate submodule == good. functionality could include:
- importing NCBI and Greengenes/GTDB taxonomies; ref NCBI import utilities, #397. Maybe also LIN / genomeRxiv relevant.
- taxonomic database exploration (basic keyword search, "how many lineages", etc)
- taxonomic summarization of search and gather output
- taxonomic filtering/collapse of search and gather output (discussed in #583 and #817)
- wherever we include taxonomy, allow overriding it to provided updated identifier -> lineage functionality. this would be a benefit of the separated lineage/taxonomy database proposed in #581
- sourmash taxonomy code could support common functionality for contamination detection/removal research projects such as sourmash-oddify, charcoal, and long read assembly decontam. Basically, we're talking about making nicer public APIs per
sourmash.lca
current functionality, e.g. #458. Seems on the edge of scope but our genomeRxiv funding will need this, perhaps. - more/better "human-focused output" per #265 is a nice goal, hard to know how to interact with NCBI and GTDB unless we can find a perma-URL identifier scheme. Presumably we'd want to add lots of live web page tests or something.
Out of scope --
- identifier/source schemas, per #230
ETE3 is a great way for filtering/manipulating/etc. NCBI taxonomy that I've used FWIW
random thoughts --
- there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.
- check to make sure what output krona and other programs (metacoder, etc.) take in, make sure we output that.
there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.
That's kind of what https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py is doing, and since the CAMI profiling output requires summarizing for each level the info is already there too (could also ask for a specific rank too, I guess)
yep - we have plenty of examples of this (sourmash lca summarize, for example). What I'm suggesting here most specifically is that we provide scripts that do this after gather is run, and separate out the "search for genomes" from the "summarize taxonomy" functionality in a way that currently is only done in ad hoc scripts in other projects. I have a prototype of this myself, over in charcoal.
But... that's what I described 🤣
This is exactly what happens in 2020-cami: I run gather first, and then use the gather output CSV to summarize taxonomy (taking a pre-calculated accession-to-taxid file for a specific database, because it takes a long time to process the full acc2taxid
every time).
Here is the order they are called in the sourmash biobox:
https://github.com/sourmash-bio/bioboxes/blob/bf27b93d86fccc34540bf709996abc3076c00122/task_sourmash.pl#L21
I've started playing around with separating revindex and lineage information, over in dib-lab/charcoal, just_taxonomy.py.
I created a LineageDB
class that just manages the lineages, and then refactored a few functions to use that instead of the internal lineage handling in LCA_Database
.
A few observations from this strategy -
- we need some new functions on
LCA_Database
- in particular,get_idents_for_hashval
- to provide intermediaries. That can be added without a new major version of sourmash. - it's not immediately clear to me how to handle loading/saving of lineage DBs to JSON, and if/how to encapsulate LineageDB within LCA_Database, but I have some thoughts. Will play.
here, from silva, is a nice example of multiple taxonomies being available in a database.
With SILVA release 138 the Opens external link in new windowGenome Taxonomy Database (GTDB) has been adopted (Parks 2018). As a consequence of our efforts the following groups were prone to significant adaptations: Archaea, Enterobacterales, Deltaproteobacteria, Firmicutes, Clostridia. Betaproteobacteriales (formerly known as Betaproteobacteria) is now Burkholderiales, an order of Gammaproteobacteria. Epsilonproteobacteria vanishes within a new phylum Campilobacterota. Tenericutes are gone, they are now all part of Bacilli, inside Firmicutes.
Additionally, every sequence in the SILVA datasets carries the EMBL-EBI/ENA taxonomy assignment. Where available, the RDP and GTDB taxonomies are added for comparison. In releases <138 also the greengenes taxonomy was added.
mmseqs2 does nice stuff with multiple taxonomies, it looks like - search for "taxonomy" on https://github.com/soedinglab/mmseqs2/wiki
note gather-to-tax.py
I was talking with @bluegenes and after checking @erikyoung85 work in #1131 and the LineageDB
from charcoal that @ctb mentioned, I would like to propose a few things:
- Create
TaxInfo
, which will be pretty close to whatLineageDB
is doing in charcoal. - I think we need to keep
lid_to_lineage
andident_to_lid
in the saved files, and reconstruct/cache other useful derivations (lineage_to_lid
, the inverse oflid_to_lineage
, is one example). - SBT and LCA indices grab info from
TaxInfo
byident
. They are free to use other mappings they need (ident_to_name
,idx_to_lid
in the LCA case, for example). - Query
TaxInfo
byident
or bylid
. Ideally just byident
, to avoid consistency problems if anident -> lid
mapping changes. TaxInfo
objects should be optional in an index.
On the Rust side it would be something like
type LineagePairs = BTreeMap<String, String>;
pub struct TaxInfo {
lid_to_lineage: HashMap<u32, LineagePairs>
ident_to_lid: HashMap<String, u32>
}
impl TaxInfo {
fn get_lineage(&self, ident: &str) -> Option<LineagePairs> {}
}
And I think it can be serialized/deserialized with some processing to the current format.
On the Python side it will be similar to LineageDB
.
Feature-wise, I think it is also important to be able to override a TaxInfo in the command line (or provide a different one for Index in Python/Rust). Especially since we "precompute" some desired ranks for LCA today, and they might not match other use cases (the CAMI profile format allows providing your own rank, for example). Critically, the important info in the TaxInfo
class is ident
, and as long as the ident
s are the same then two different TaxInfo
objects can be interchangeable. This is a very lax definition, but I think it also fits well with other taxonomies (since, fundamentally, LineagePairs
can be any list of (rank, value)
that the user desires.
I was trying to process an existing LCA index with this proposed format, and... is tests/test-data/lca/delmont-1.lca.json
broken? I was planning to use ident_to_idx
and idx_to_lid
to generate ident_to_lid
, but idx_to_lid
seems to be missing the mapping for idx=1
?
$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json
{
"TARA_ASE_MAG_00031": 0,
"TARA_PSW_MAG_00136": 1
}
$ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json
{
"0": 0
}
Or should lineage be optional? That is OK too, I guess. The ident
would not be included in any taxonomic commands (classify
, lca summarize
and so on)
re #969 (comment), I am ok with doing the Rust API so that that is easy, but I would suggest getting the current functionality oxidized and merged first before a split. Otherwise you end up with too many moving parts.
I was trying to process an existing LCA index with this proposed format, and... is
tests/test-data/lca/delmont-1.lca.json
broken? I was planning to useident_to_idx
andidx_to_lid
to generateident_to_lid
, butidx_to_lid
seems to be missing the mapping foridx=1
?$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json { "TARA_ASE_MAG_00031": 0, "TARA_PSW_MAG_00136": 1 } $ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json { "0": 0 }
Or should lineage be optional? That is OK too, I guess. The
ident
would not be included in any taxonomic commands (classify
,lca summarize
and so on)
yes, lineage is optional in LCA databases.
random aside - splitting revindex and taxonomy would mean that we can update taxonomy without updating revindex, which currently is impossible in LCA databases.
(an interesting short research paper might be to compare and contrast gather and LCA approaches on the same database directly)
(and also GTDB and NCBI taxonomy result comparisons by finding the same k-mers, then applying different taxonomies)