open2c/bioframe

Local sources of truth for common genome assembly metadata

Closed this issue · 0 comments

Problem

The problem with chromsizes files in the wild

  • Genome assemblies include multiple assembly units, and different sequence roles or types besides fully assembled molecules (unlocalized, unplaced, alt, patch, decoy). For many users, most of these types are not relevant for downstream analysis.
  • Additionally, UCSC, Ensembl and other providers derive slightly different flavors of a common reference assembly (e.g. GRCh37 vs hg19, GRCh38 vs hg38 vs hg38Analysis).
  • Providers also use multiple aliases for scaffold names.
  • Providers also do not give a convention for scaffold ordering. UCSC tends to sort all scaffolds in descending order of size.

There is no centralized and standardized source of truth for any of these metadata. Some sources that come close are:

  • NCBI has various statistics files like this, with inconsistencies from assembly to assembly.
  • UCSC has pages with links to info for its assemblies in the hgGateway, such as chromInfo.txt and chromAlias.txt. Again, not all assemblies have the same files.

Solution

Seqinfo files

We have started to compile metadata files (currently *.seqinfo.tsv) for a handful of common genome assemblies: https://github.com/open2c/bioframe/tree/main/bioframe/io/data

These seqinfo files have the following columns:

  • A canonical scaffold name (normally UCSC-style)
  • Scaffold length
  • Scaffold type: assembled, unlocalized, unplaced, alt
  • Scaffold molecule: canonical name for the molecule (e.g. unlocalized scaffold chr1_KI270706v1_random comes from molecule chr1)
  • Assembly unit: primary, non-nuclear, decoy, etc.
  • A comma-separated list of aliases for the scaffold.

We also provide a canonical ordering of scaffolds. Currently, we use these rules:

  1. Fully-assembled molecules from the primary assembly in natural numeric order.
  2. Fully-assembled molecules from the non-nuclear assembly (mitochondria, plastids), also in natural order.
  3. Unlocalized scaffolds, sorted as above.
  4. Unplaced scaffolds, sorted as above.
  5. Decoy scaffolds, sorted as above.

Currently, we deliberately exclude alternate and patch sequences from seqinfo files. As such, we will not track assembly patch releases.

Open questions

How to expose seqinfo files as local database?

bioframe.read_table("path/to/assembly.seqinfo.tsv", schema='seqinfo')
bioframe.get_seqinfo("hg38")
bioframe.get_chromsizes("GRCm39" , unit=("primary", "non-nuclear"), type=("assembled",))
bioframe.get_aliasdict("hg38")  # maps various alias names to the canonical name

Handling quirks between provider's assembly versions

These functions should dispatch assembly aliases (e.g. mm10, GRCm38) to the same common seqinfo database.

Weird caveats, like the different mitochondrial sequences used in hg19 vs GRCh37 should be documented somewhere.

  • For example, right now, we are using ad hoc ways to harmonize hg19 and GRCh37 rather than treat them as different assemblies. That's mostly because UCSC recently added the revised mitochondrial sequence to hg19 as an additional sequence called chrMT. Accordingly, we included chrMT as part of an assembly unit called non-nuclear-revised.
  • Another example is UCSC's hg38Analysis set, which includes an Epstein-Barr virus sequence as decoy. At the moment, we have decided to include it in the hg38 seqinfo file as part of a "decoy" (pseudo) assembly unit.