The OMG is a 3.1T base pair metagenomic pretraining dataset, combining EMBL's MGnify and JGI's IMG databases. The combined data is pre-processed into a mixed-modality dataset, with translated amino acids for protein coding sequences, and nucleic acids for intergenic sequences.
We make three datasets available on the HuggingFace Hub:
-
OMG
: A mixed-modality dataset containing 3.3B protein sequences and 2.8B intergenic sequences. -
OG
: A subset ofOMG
consisting of high quality genomes with taxonomic information. Contains 0.4B protein sequences and 0.3B intergenic sequences. -
OMG_prot50
: A protein-only dataset generated by clusteringOMG
at 50% sequence identity, resulting in 207M protein sequences.
Install the HuggingFace Datasets library:
pip install datasets
Download OMG from the Huggingface Hub:
import datasets
ds = datasets.load_dataset('tattabio/OMG')
To preview the dataset without downloading, load in streaming mode:
import datasets
ds = datasets.load_dataset('tattabio/OMG', streaming=True)['train']
print(next(iter(ds)))
An example of tokenizing the dataset is provided in scripts/tokenize.py
.
cd scripts
python tokenize.py --dataset_name=tattabio/OMG
Each row of the dataset represents a genomic scaffold, as an ordered list of amino acid coding sequences (CDS) and nucleotide intergenic sequences (IGS).
Feature | Description | Example |
---|---|---|
CDS_seqs |
A list of strings representing the amino acid CDS sequences. | ['MALTKVEKRNR...', 'MLGIDNIERVK...', 'MATIKVKQVR...', 'MNLSNIKPAS...'] |
IGS_seqs |
A list of strings representing the nucleotide IGS sequences. | ['AATTTAAGGAA', 'TTTTAAAAGTATCGAAAT', 'TTTTTAAAGAAAA'] |
CDS_position_ids |
A list of integers representing the position of each CDS element in the scaffold. | [1, 3, 5, 6] |
IGS_position_ids |
A list of integers representing the position of each IGS element in the scaffold. | [0, 2, 4] |
CDS_ids |
A list of string identifiers for each CDS element. | ['7000000126|C1821366|CDS|gene_115413|+|84:437', '7000000126|C1821366|CDS|gene_115414|+|456:977', '7000000126|C1821366|CDS|gene_115415|+|991:1167', '7000000126|C1821366|CDS|gene_115416|+|1168:1689'] |
IGS_ids |
A list of string identifiers for each IGS element. | ['7000000126|C1821366|IG|IG_000001|+|73:83', '7000000126|C1821366|IG|IG_000002|+|438:455', '7000000126|C1821366|IG|IG_000003|+|978:990'] |
CDS_orientations |
A list of booleans indicating the orientation of each CDS. True represents the forward strand, and False represents the reverse strand. |
[True, True, True, False] |
The format for the CDS and IGS id fields is: sample_accession|contig_id|feature_type|gene_id|strand|start:end
OMG was introduced in "The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling", feel free to cite:
@article{Cornman2024,
title = {The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling},
url = {https://www.biorxiv.org/content/early/2024/08/17/2024.08.14.607850},
DOI = {10.1101/2024.08.14.607850},
publisher = {Cold Spring Harbor Laboratory},
author = {Cornman, Andre and West-Roberts, Jacob and Camargo, Antonio Pedro and Roux, Simon and Beracochea, Martin and Mirdita, Milot and Ovchinnikov, Sergey and Hwang, Yunha},
year = {2024},
}