/Genomics_LLMs

A collection of genomics LLMs

Genomics Foundation Models and LLMs

A GitHub repository containing a collection of Foundation Models and LLMs for Genomics and Bioinformatics.

Foundation Models for Genomics

Foundation Models are a broad class of large-scale models that serve as the base or foundation for various downstream tasks. Genomic foundation models are trained on large-scale genomic datasets such as DNA/RNA/Protein sequencing data and other datasets as such

  1. BigRNA: BigRNA an advanced AI foundation model with 1.8 billion tunable parameters, created from 1 trillion genomic signals. It can accurately predict thousands of different molecular biology outcomes, which enables the discovery of targets, disease mechanisms, and RNA therapeutics.

  2. RNA-FM: RNA-FM is capable of incorporating coding sequences (CDS) and representing them with contextual embeddings, providing benefits for mRNA and protein-related tasks.

  3. DNABERT-S: DNABERT-S is a genome foundation model that specializes in creating species-aware DNA embeddings.

  4. Evo: Evo is a DNA foundation model that can be used for modeling on a molecular to genome-scale.

  5. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA language in genome

  6. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

  7. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Genomics

  8. Genomic sequence classification: Benchmarking DNA Foundation Models for Genomic Sequence Classification

Large Language Models (LLMs) for Genomics

LLMs are a subset of foundation models specifically focused on Genomic related tasks

  1. CodonBert: CodonBert is a LLM for RNA that mainly helps with the optimization of codons in a mRNA sequence. CodonBERT uses codons as inputs which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from diverse organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties.

  2. BioLLMBench: A Comprehensive Benchmarking of LLMs in Bioinformatics.

  3. Genomic Language Models: Opportunities and Challenges