/OMG

Primary LanguagePython

OMG: An Open MetaGenomic Dataset

bioRxiv URLHuggingface URL

The OMG is a 3.1T base pair metagenomic pretraining dataset, combining EMBL's MGnify and JGI's IMG databases. The combined data is pre-processed into a mixed-modality dataset, with translated amino acids for protein coding sequences, and nucleic acids for intergenic sequences.

We make three datasets available on the HuggingFace Hub:

  • OMG: A mixed-modality dataset containing 3.3B protein sequences and 2.8B intergenic sequences.

  • OG: A subset of OMG consisting of high quality genomes with taxonomic information. Contains 0.4B protein sequences and 0.3B intergenic sequences.

  • OMG_prot50: A protein-only dataset generated by clustering OMG at 50% sequence identity, resulting in 207M protein sequences.

Usage

Install the HuggingFace Datasets library:
pip install datasets

Download OMG from the Huggingface Hub:

import datasets

ds = datasets.load_dataset('tattabio/OMG')

To preview the dataset without downloading, load in streaming mode:

import datasets

ds = datasets.load_dataset('tattabio/OMG', streaming=True)['train']
print(next(iter(ds)))

An example of tokenizing the dataset is provided in scripts/tokenize.py.

cd scripts
python tokenize.py --dataset_name=tattabio/OMG

Format

Each row of the dataset represents a genomic scaffold, as an ordered list of amino acid coding sequences (CDS) and nucleotide intergenic sequences (IGS).

Feature Description Example
CDS_seqs A list of strings representing the amino acid CDS sequences. ['MALTKVEKRNR...', 'MLGIDNIERVK...', 'MATIKVKQVR...', 'MNLSNIKPAS...']
IGS_seqs A list of strings representing the nucleotide IGS sequences. ['AATTTAAGGAA', 'TTTTAAAAGTATCGAAAT', 'TTTTTAAAGAAAA']
CDS_position_ids A list of integers representing the position of each CDS element in the scaffold. [1, 3, 5, 6]
IGS_position_ids A list of integers representing the position of each IGS element in the scaffold. [0, 2, 4]
CDS_ids A list of string identifiers for each CDS element. ['7000000126|C1821366|CDS|gene_115413|+|84:437', '7000000126|C1821366|CDS|gene_115414|+|456:977', '7000000126|C1821366|CDS|gene_115415|+|991:1167', '7000000126|C1821366|CDS|gene_115416|+|1168:1689']
IGS_ids A list of string identifiers for each IGS element. ['7000000126|C1821366|IG|IG_000001|+|73:83', '7000000126|C1821366|IG|IG_000002|+|438:455', '7000000126|C1821366|IG|IG_000003|+|978:990']
CDS_orientations A list of booleans indicating the orientation of each CDS. True represents the forward strand, and False represents the reverse strand. [True, True, True, False]

The format for the CDS and IGS id fields is: sample_accession|contig_id|feature_type|gene_id|strand|start:end

Citing

OMG was introduced in "The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling", feel free to cite:

@article{Cornman2024,
  title = {The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling},
  url = {https://www.biorxiv.org/content/early/2024/08/17/2024.08.14.607850},
  DOI = {10.1101/2024.08.14.607850},
  publisher = {Cold Spring Harbor Laboratory},
  author = {Cornman, Andre and West-Roberts, Jacob and Camargo, Antonio Pedro and Roux, Simon and Beracochea, Martin and Mirdita, Milot and Ovchinnikov, Sergey and Hwang, Yunha},
  year = {2024},
}