sourmash-bio/sourmash

adding lineage manipulation & taxonomy reporting in more places in sourmash?

ctb opened this issue · 22 comments

ctb commented

We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).

The basic idea here is that we combine taxonomy with search and gather on signatures/SBTs/LCAs, by connecting identifiers with lineages -- do dynamically what sourmash lca index does once.

This falls under use cases of taxonomic filtering and reporting.

What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by sourmash search and gather and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.

I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.

It might be possible to add taxonomy/lineage-aware selector functionality on signatures and databases, e.g. "run a gather on this database, but ignore all results that are not Archaea."

Should also look at sourmash lca gather to see about collapsing results taxonomically...

Thoughts welcome!

related conversations

  • the ZipStorage work in #648 makes it plausible to package SBTs up together with a lineage.
  • related conversation in this comment in #581

(Sorry, wrong button)

We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).

What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by sourmash search and gather and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.

Another example: this is the script I'm using for thesis work and 2020-cami: https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py

It's using pandas (for summarizing at different taxonomic levels) and the taxonomy package for parsing taxonomies, and I avoided bringing this into sourmash because they are heavy-ish deps, but having a sourmash taxonomy subcommand for working with similar tasks (and bringing the lineage parsers necessary for LCA databases, for example) seems a good idea.

(Converting from an SBT to an LCA index like in https://github.com/luizirber/2020-cami/blob/f7f34d3903cd7d2b6bc2ac0471f7d53a42aa86b2/rules/build_indices.smk#L64L66 without depending on external code would be pretty cool. I guess the revindex part is already doable today?)

I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.

Totally agree, and I think it also hurts functionality. Deposited genomes in genbank or refseq don't change (but they might get new versions), but the accession to taxid mapping CAN change (see: recent Lactobacillus genera being split).

It is so hard to deal with the NCBI taxonomy because it changes and doesn't have stable downloads for a specific date, so while generating taxinfo and putting it into SBTs has the benefit of capturing the taxonomy at one point in time (and being very convenient to use), but the drawback of potentially being outdated.

the ZipStorage work in #648 makes it plausible to package SBTs up together with a lineage.

Yup, but would also like to see some option to override it somehow (re: outdated taxonomies)

ctb commented

OK, a few more thoughts and some summarization --

  • splitting LCA/SBT hash storage from lineage storage as per #581 is a good idea. this would let us separate issues of hash storage (DNA vs protein vs dayhoff vs hp, skip-mers, abundances, and so on) out from lineage manipulation.
  • adding taxonomy utilities as a separate submodule == good. functionality could include:
    • importing NCBI and Greengenes/GTDB taxonomies; ref NCBI import utilities, #397. Maybe also LIN / genomeRxiv relevant.
    • taxonomic database exploration (basic keyword search, "how many lineages", etc)
    • taxonomic summarization of search and gather output
    • taxonomic filtering/collapse of search and gather output (discussed in #583 and #817)
  • wherever we include taxonomy, allow overriding it to provided updated identifier -> lineage functionality. this would be a benefit of the separated lineage/taxonomy database proposed in #581
  • sourmash taxonomy code could support common functionality for contamination detection/removal research projects such as sourmash-oddify, charcoal, and long read assembly decontam. Basically, we're talking about making nicer public APIs per sourmash.lca current functionality, e.g. #458. Seems on the edge of scope but our genomeRxiv funding will need this, perhaps.
  • more/better "human-focused output" per #265 is a nice goal, hard to know how to interact with NCBI and GTDB unless we can find a perma-URL identifier scheme. Presumably we'd want to add lots of live web page tests or something.

Out of scope --

  • identifier/source schemas, per #230

ETE3 is a great way for filtering/manipulating/etc. NCBI taxonomy that I've used FWIW

ctb commented

another piece of taxonomy related functionality - newick output, ref #915

ctb commented

random thoughts --

  • there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.
  • check to make sure what output krona and other programs (metacoder, etc.) take in, make sure we output that.

there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.

That's kind of what https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py is doing, and since the CAMI profiling output requires summarizing for each level the info is already there too (could also ask for a specific rank too, I guess)

ctb commented

yep - we have plenty of examples of this (sourmash lca summarize, for example). What I'm suggesting here most specifically is that we provide scripts that do this after gather is run, and separate out the "search for genomes" from the "summarize taxonomy" functionality in a way that currently is only done in ad hoc scripts in other projects. I have a prototype of this myself, over in charcoal.

But... that's what I described 🤣

This is exactly what happens in 2020-cami: I run gather first, and then use the gather output CSV to summarize taxonomy (taking a pre-calculated accession-to-taxid file for a specific database, because it takes a long time to process the full acc2taxid every time).
Here is the order they are called in the sourmash biobox:
https://github.com/sourmash-bio/bioboxes/blob/bf27b93d86fccc34540bf709996abc3076c00122/task_sourmash.pl#L21

ctb commented
ctb commented

I've started playing around with separating revindex and lineage information, over in dib-lab/charcoal, just_taxonomy.py.

I created a LineageDB class that just manages the lineages, and then refactored a few functions to use that instead of the internal lineage handling in LCA_Database.

A few observations from this strategy -

  • we need some new functions on LCA_Database - in particular, get_idents_for_hashval - to provide intermediaries. That can be added without a new major version of sourmash.
  • it's not immediately clear to me how to handle loading/saving of lineage DBs to JSON, and if/how to encapsulate LineageDB within LCA_Database, but I have some thoughts. Will play.
ctb commented

here, from silva, is a nice example of multiple taxonomies being available in a database.

With SILVA release 138 the Opens external link in new windowGenome Taxonomy Database (GTDB) has been adopted (Parks 2018). As a consequence of our efforts the following groups were prone to significant adaptations: Archaea, Enterobacterales, Deltaproteobacteria, Firmicutes, Clostridia. Betaproteobacteriales (formerly known as Betaproteobacteria) is now Burkholderiales, an order of Gammaproteobacteria. Epsilonproteobacteria vanishes within a new phylum Campilobacterota. Tenericutes are gone, they are now all part of Bacilli, inside Firmicutes.

Additionally, every sequence in the SILVA datasets carries the EMBL-EBI/ENA taxonomy assignment. Where available, the RDP and GTDB taxonomies are added for comparison. In releases <138 also the greengenes taxonomy was added.

ctb commented

mmseqs2 does nice stuff with multiple taxonomies, it looks like - search for "taxonomy" on https://github.com/soedinglab/mmseqs2/wiki

I was talking with @bluegenes and after checking @erikyoung85 work in #1131 and the LineageDB from charcoal that @ctb mentioned, I would like to propose a few things:

  • Create TaxInfo, which will be pretty close to what LineageDB is doing in charcoal.
  • I think we need to keep lid_to_lineage and ident_to_lid in the saved files, and reconstruct/cache other useful derivations (lineage_to_lid, the inverse of lid_to_lineage, is one example).
  • SBT and LCA indices grab info from TaxInfo by ident. They are free to use other mappings they need (ident_to_name, idx_to_lid in the LCA case, for example).
  • Query TaxInfo by ident or by lid. Ideally just by ident, to avoid consistency problems if an ident -> lid mapping changes.
  • TaxInfo objects should be optional in an index.

On the Rust side it would be something like

type LineagePairs = BTreeMap<String, String>;

pub struct TaxInfo {
    lid_to_lineage: HashMap<u32, LineagePairs>
    ident_to_lid: HashMap<String, u32>
}

impl TaxInfo {
  fn get_lineage(&self, ident: &str) -> Option<LineagePairs> {}
}

And I think it can be serialized/deserialized with some processing to the current format.

On the Python side it will be similar to LineageDB.


Feature-wise, I think it is also important to be able to override a TaxInfo in the command line (or provide a different one for Index in Python/Rust). Especially since we "precompute" some desired ranks for LCA today, and they might not match other use cases (the CAMI profile format allows providing your own rank, for example). Critically, the important info in the TaxInfo class is ident, and as long as the idents are the same then two different TaxInfo objects can be interchangeable. This is a very lax definition, but I think it also fits well with other taxonomies (since, fundamentally, LineagePairs can be any list of (rank, value) that the user desires.

I was trying to process an existing LCA index with this proposed format, and... is tests/test-data/lca/delmont-1.lca.json broken? I was planning to use ident_to_idx and idx_to_lid to generate ident_to_lid, but idx_to_lid seems to be missing the mapping for idx=1?

$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json
{
  "TARA_ASE_MAG_00031": 0,
  "TARA_PSW_MAG_00136": 1
}

$ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json
{
  "0": 0
}

Or should lineage be optional? That is OK too, I guess. The ident would not be included in any taxonomic commands (classify, lca summarize and so on)

ctb commented

re #969 (comment), I am ok with doing the Rust API so that that is easy, but I would suggest getting the current functionality oxidized and merged first before a split. Otherwise you end up with too many moving parts.

ctb commented

I was trying to process an existing LCA index with this proposed format, and... is tests/test-data/lca/delmont-1.lca.json broken? I was planning to use ident_to_idx and idx_to_lid to generate ident_to_lid, but idx_to_lid seems to be missing the mapping for idx=1?

$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json
{
  "TARA_ASE_MAG_00031": 0,
  "TARA_PSW_MAG_00136": 1
}

$ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json
{
  "0": 0
}

Or should lineage be optional? That is OK too, I guess. The ident would not be included in any taxonomic commands (classify, lca summarize and so on)

yes, lineage is optional in LCA databases.

ctb commented

random aside - splitting revindex and taxonomy would mean that we can update taxonomy without updating revindex, which currently is impossible in LCA databases.

ctb commented

(an interesting short research paper might be to compare and contrast gather and LCA approaches on the same database directly)

(and also GTDB and NCBI taxonomy result comparisons by finding the same k-mers, then applying different taxonomies)

ctb commented

In #1808, I'm adding SqliteIndex, which could also support LCA-style functionality as well as storing taxonomy information more generally, all in the same file; revisiting #1651, this is just the taxonomy table defined in tax/tax_utils.py.