Local sources of truth for common genome assembly metadata
Closed this issue · 0 comments
Problem
The problem with chromsizes
files in the wild
- Genome assemblies include multiple assembly units, and different sequence roles or types besides fully assembled molecules (unlocalized, unplaced, alt, patch, decoy). For many users, most of these types are not relevant for downstream analysis.
- Additionally, UCSC, Ensembl and other providers derive slightly different flavors of a common reference assembly (e.g. GRCh37 vs hg19, GRCh38 vs hg38 vs hg38Analysis).
- Providers also use multiple aliases for scaffold names.
- Providers also do not give a convention for scaffold ordering. UCSC tends to sort all scaffolds in descending order of size.
There is no centralized and standardized source of truth for any of these metadata. Some sources that come close are:
- NCBI has various statistics files like this, with inconsistencies from assembly to assembly.
- UCSC has pages with links to info for its assemblies in the hgGateway, such as
chromInfo.txt
andchromAlias.txt
. Again, not all assemblies have the same files.
Solution
Seqinfo files
We have started to compile metadata files (currently *.seqinfo.tsv
) for a handful of common genome assemblies: https://github.com/open2c/bioframe/tree/main/bioframe/io/data
These seqinfo files have the following columns:
- A canonical scaffold name (normally UCSC-style)
- Scaffold length
- Scaffold type: assembled, unlocalized, unplaced, alt
- Scaffold molecule: canonical name for the molecule (e.g. unlocalized scaffold
chr1_KI270706v1_random
comes from moleculechr1
) - Assembly unit: primary, non-nuclear, decoy, etc.
- A comma-separated list of aliases for the scaffold.
We also provide a canonical ordering of scaffolds. Currently, we use these rules:
- Fully-assembled molecules from the primary assembly in natural numeric order.
- Fully-assembled molecules from the non-nuclear assembly (mitochondria, plastids), also in natural order.
- Unlocalized scaffolds, sorted as above.
- Unplaced scaffolds, sorted as above.
- Decoy scaffolds, sorted as above.
Currently, we deliberately exclude alternate and patch sequences from seqinfo files. As such, we will not track assembly patch releases.
Open questions
How to expose seqinfo files as local database?
bioframe.read_table("path/to/assembly.seqinfo.tsv", schema='seqinfo')
bioframe.get_seqinfo("hg38")
bioframe.get_chromsizes("GRCm39" , unit=("primary", "non-nuclear"), type=("assembled",))
bioframe.get_aliasdict("hg38") # maps various alias names to the canonical name
Handling quirks between provider's assembly versions
These functions should dispatch assembly aliases (e.g. mm10, GRCm38) to the same common seqinfo database.
Weird caveats, like the different mitochondrial sequences used in hg19 vs GRCh37 should be documented somewhere.
- For example, right now, we are using ad hoc ways to harmonize hg19 and GRCh37 rather than treat them as different assemblies. That's mostly because UCSC recently added the revised mitochondrial sequence to hg19 as an additional sequence called
chrMT
. Accordingly, we includedchrMT
as part of an assembly unit callednon-nuclear-revised
. - Another example is UCSC's
hg38Analysis
set, which includes an Epstein-Barr virus sequence as decoy. At the moment, we have decided to include it in thehg38
seqinfo file as part of a "decoy" (pseudo) assembly unit.