OMG: An Open MetaGenomic Dataset

Usage | Format | Citing

The OMG is a 3.1T base pair metagenomic pretraining dataset, combining EMBL's MGnify and JGI's IMG databases. The combined data is pre-processed into a mixed-modality dataset, with translated amino acids for protein coding sequences, and nucleic acids for intergenic sequences.

We make three datasets available on the HuggingFace Hub:

OMG: A mixed-modality dataset containing 3.3B protein sequences and 2.8B intergenic sequences.
OG: A subset of OMG consisting of high quality genomes with taxonomic information. Contains 0.4B protein sequences and 0.3B intergenic sequences.
OMG_prot50: A protein-only dataset generated by clustering OMG at 50% sequence identity, resulting in 207M protein sequences.

Usage

Install the HuggingFace Datasets library:
pip install datasets

Download OMG from the Huggingface Hub:

import datasets

ds = datasets.load_dataset('tattabio/OMG')

To preview the dataset without downloading, load in streaming mode:

import datasets

ds = datasets.load_dataset('tattabio/OMG', streaming=True)['train']
print(next(iter(ds)))

An example of tokenizing the dataset is provided in scripts/tokenize.py.

cd scripts
python tokenize.py --dataset_name=tattabio/OMG

Format

Each row of the dataset represents a genomic scaffold, as an ordered list of amino acid coding sequences (CDS) and nucleotide intergenic sequences (IGS).

Feature	Description	Example
`CDS_seqs`	A list of strings representing the amino acid CDS sequences.	`['MALTKVEKRNR...', 'MLGIDNIERVK...', 'MATIKVKQVR...', 'MNLSNIKPAS...']`
`IGS_seqs`	A list of strings representing the nucleotide IGS sequences.	`['AATTTAAGGAA', 'TTTTAAAAGTATCGAAAT', 'TTTTTAAAGAAAA']`
`CDS_position_ids`	A list of integers representing the position of each CDS element in the scaffold.	`[1, 3, 5, 6]`
`IGS_position_ids`	A list of integers representing the position of each IGS element in the scaffold.	`[0, 2, 4]`
`CDS_ids`	A list of string identifiers for each CDS element.	`['7000000126\|C1821366\|CDS\|gene_115413\|+\|84:437', '7000000126\|C1821366\|CDS\|gene_115414\|+\|456:977', '7000000126\|C1821366\|CDS\|gene_115415\|+\|991:1167', '7000000126\|C1821366\|CDS\|gene_115416\|+\|1168:1689']`
`IGS_ids`	A list of string identifiers for each IGS element.	`['7000000126\|C1821366\|IG\|IG_000001\|+\|73:83', '7000000126\|C1821366\|IG\|IG_000002\|+\|438:455', '7000000126\|C1821366\|IG\|IG_000003\|+\|978:990']`
`CDS_orientations`	A list of booleans indicating the orientation of each CDS. `True` represents the forward strand, and `False` represents the reverse strand.	`[True, True, True, False]`

Citing

OMG was introduced in "The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling", feel free to cite:

@article{Cornman2024,
  title = {The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling},
  url = {https://www.biorxiv.org/content/early/2024/08/17/2024.08.14.607850},
  DOI = {10.1101/2024.08.14.607850},
  publisher = {Cold Spring Harbor Laboratory},
  author = {Cornman, Andre and West-Roberts, Jacob and Camargo, Antonio Pedro and Roux, Simon and Beracochea, Martin and Mirdita, Milot and Ovchinnikov, Sergey and Hwang, Yunha},
  year = {2024},
}

TattaBio/OMG

OMG: An Open MetaGenomic Dataset

Usage | Format | Citing

Usage

Format

Citing