/Awesome-Biomolecule-Language-Cross-Modeling

Awesome-Biomolecule-Language-Cross-Modeling: a curated list of resources for paper "Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey"

MIT LicenseMIT

🧬📝 Awesome Biomolecule-Language Cross Modeling

Awesome Stars Forks

The repository for Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey, including related models, datasets/benchmarks, and other resource links.

🔥 We will keep this repository updated.

🌟 If you have a paper or resource you'd like to add, feel free to submit a pull request, open an issue, or email the author at qizhipei@ruc.edu.cn.

Table of Content


Models

Biotext

  • BioBERT: a pre-trained biomedical language representation model for biomedical text mining

    Dynamic JSON Badge Stars Model

  • SciBERT: A Pretrained Language Model for Scientific Text

    Dynamic JSON Badge Stars Model

  • (BlueBERT) Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

    Dynamic JSON Badge Stars Model

  • Bio-Megatron: Larger Biomedical Domain Language Model

    Dynamic JSON Badge Stars

  • ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

    Dynamic JSON Badge Stars Model

  • BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA

    Dynamic JSON Badge Stars Model

  • (PubMedBERT) Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

    Dynamic JSON Badge Model

  • SciFive: a text-to-text transformer model for biomedical literature

    Dynamic JSON Badge Stars Model

  • (DRAGON) Deep Bidirectional Language-Knowledge Graph Pretraining

    Dynamic JSON Badge Stars Model

  • LinkBERT: Pretraining Language Models with Document Links

    Dynamic JSON Badge Stars Model

  • BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model

    Dynamic JSON Badge Stars Model

  • BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

    Dynamic JSON Badge Stars Model

  • GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records

    Dynamic JSON Badge Stars

  • Large language models encode clinical knowledge

    Dynamic JSON Badge

  • (ScholarBERT) The Diminishing Returns of Masked Language Models to Science

    Dynamic JSON Badge Model

  • PMC-LLaMA: Further Finetuning LLaMA on Medical Papers

    Dynamic JSON Badge Stars Model

  • BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

    Dynamic JSON Badge Stars Model

  • (GatortronGPT) A study of generative large language model for medical research and healthcare

    Dynamic JSON Badge Stars

  • Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

    Dynamic JSON Badge Stars Model

  • MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    Dynamic JSON Badge Stars Model

  • BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials

    Dynamic JSON Badge Model

  • ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

    Dynamic JSON Badge Model

  • MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data

    Dynamic JSON Badge Stars Model

  • SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning

    Dynamic JSON Badge Stars Model

  • BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

    Dynamic JSON Badge Stars Model

  • BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

    Dynamic JSON Badge Model

Text + Molecule

  • Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries

    Dynamic JSON Badge Stars

  • (MolT5) Translation between Molecules and Natural Language

    Dynamic JSON Badge Stars Model

  • (KV-PLM) A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals

    Dynamic JSON Badge Stars Model

  • (MoMu) A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

    Dynamic JSON Badge Stars Model

  • (Text+Chem T5) Unifying Molecular and Textual Representations via Multi-task Language Modelling

    Dynamic JSON Badge Stars Model

  • (CLAMP) Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language

    Dynamic JSON Badge Stars Model

  • GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning

    Dynamic JSON Badge Stars Model

  • (HI-Mol) Data-Efficient Molecular Generation with Hierarchical Textual Inversion

  • MoleculeGPT: Instruction Following Large Language Models for Molecular Property Prediction

  • (ChemLLMBench) What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks

    Dynamic JSON Badge Stars

  • MolXPT: Wrapping Molecules with Text for Generative Pre-training

    Dynamic JSON Badge

  • (TextReact) Predictive Chemistry Augmented with Text Retrieval

    Dynamic JSON Badge Stars

  • MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

    Dynamic JSON Badge Stars Model

  • ReLM: Leveraging Language Models for Enhanced Chemical Reaction Prediction

    Dynamic JSON Badge Stars

  • (MoleculeSTM) Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

    Dynamic JSON Badge Stars Model

  • (AMAN) Adversarial Modality Alignment Network for Cross-Modal Molecule Retrieval

    Dynamic JSON Badge Stars

  • MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations

    Dynamic JSON Badge Stars Model

  • (MolReGPT) Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

    Dynamic JSON Badge Stars

  • (CaR) Can Large Language Models Empower Molecular Property Prediction?

    Dynamic JSON Badge Stars

  • MolFM: A Multimodal Molecular Foundation Model

    Dynamic JSON Badge Stars Model

  • (ChatMol) Interactive Molecular Discovery with Natural Language

    Dynamic JSON Badge Stars Model

  • InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery

    Dynamic JSON Badge Stars

  • ChemCrow: Augmenting large-language models with chemistry tools

    Dynamic JSON Badge Stars

  • GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction

    Dynamic JSON Badge Stars

  • nach0: Multimodal Natural and Chemical Languages Foundation Model

    Dynamic JSON Badge

  • DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs

    Dynamic JSON Badge Stars

  • (Ada/Aug-T5) From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

    Dynamic JSON Badge Stars Model

  • MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts

    Dynamic JSON Badge Stars

  • (TGM-DLM) Text-Guided Molecule Generation with Diffusion Language Model

    Dynamic JSON Badge Stars

  • GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

    Dynamic JSON Badge Stars

  • PolyNC: a natural and chemical language model for the prediction of unified polymer properties

    Dynamic JSON Badge Stars Model

  • MolTC: Towards Molecular Relational Modeling In Language Models

    Dynamic JSON Badge Stars

  • T-Rex: Text-assisted Retrosynthesis Prediction

    Dynamic JSON Badge Stars

  • LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

    Dynamic JSON Badge Stars Model

  • (Drug-to-indication) Emerging Opportunities of Using Large Language Models for Translation Between Drug Molecules and Indications

    Dynamic JSON Badge Stars

  • ChemDFM: Dialogue Foundation Model for Chemistry

    Dynamic JSON Badge

  • DrugAssist: A Large Language Model for Molecule Optimization

    Dynamic JSON Badge Stars

  • ChemLLM: A Chemical Large Language Model

    Dynamic JSON Badge Model

  • (TEDMol) Text-guided Diffusion Model for 3D Molecule Generation

  • (3DToMolo) Sculpting Molecules in 3D: A Flexible Substructure Aware Framework for Text-Oriented Molecular Optimization

    Dynamic JSON Badge

  • (ICMA) Large Language Models are In-Context Molecule Learners

    Dynamic JSON Badge

  • Benchmarking Large Language Models for Molecule Prediction Tasks

    Dynamic JSON Badge Stars

  • DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs

    Dynamic JSON Badge

  • 3M-Diffusion: Latent Multi-Modal Diffusion for Text-Guided Generation of Molecular Graphs

    Dynamic JSON Badge Stars

  • (TSMMG) Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

    Stars

  • A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions

    Stars

  • Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation

    Dynamic JSON Badge

  • ReactXT: Understanding Molecular"Reaction-ship"via Reaction-Contextualized Molecule-Text Pretraining

    Dynamic JSON Badge Stars

  • LDMol: Text-Conditioned Molecule Diffusion Model Leveraging Chemically Informative Latent Space

    Dynamic JSON Badge

  • (MV-Mol) Learning Multi-view Molecular Representations with Structured and Unstructured Knowledge

    Dynamic JSON Badge Stars

  • HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

    Dynamic JSON Badge

  • PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes

    Dynamic JSON Badge Stars

  • 3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

    Dynamic JSON Badge

  • MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction

    Dynamic JSON Badge Stars Model

  • DrugLLM: Open Large Language Model for Few-shot Molecule Generation

    Dynamic JSON Badge

  • (AMOLE) Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models

    Dynamic JSON Badge Stars

  • Chemical Language Models Have Problems with Chemistry: A Case Study on Molecule Captioning Task

    Dynamic JSON Badge Stars

  • MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

    Dynamic JSON Badge

  • UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation

    Dynamic JSON Badge

Text + Protein

  • OntoProtein: Protein Pretraining With Gene Ontology Embedding

    Dynamic JSON Badge Stars Model

  • ProTranslator: Zero-Shot Protein Function Prediction Using Textual Description

    Dynamic JSON Badge Stars

  • ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

    Dynamic JSON Badge Stars Model

  • InstructProtein: Aligning Human and Protein Language via Knowledge Instruction

    Dynamic JSON Badge Stars

  • (ProteinDT) A Text-guided Protein Design Framework

    Dynamic JSON Badge Stars

  • ProteinChat: Towards Achieving ChatGPT-Like Functionalities on Protein 3D Structures

    Stars

  • Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers

    Dynamic JSON Badge Stars Model

  • ProtChatGPT: Towards Understanding Proteins with Large Language Models

    Dynamic JSON Badge

  • ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

    Dynamic JSON Badge Stars

  • ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

    Dynamic JSON Badge Stars Model

  • ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

    Dynamic JSON Badge Stars Model

  • ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

    Dynamic JSON Badge Stars

  • ProteinCLIP: enhancing protein language models with natural language

    Dynamic JSON Badge Stars

  • ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

    Dynamic JSON Badge Stars

  • (PAAG) Functional Protein Design with Local Domain Alignment

    Dynamic JSON Badge

  • (Pinal) Toward De Novo Protein Design from Natural Language

    Dynamic JSON Badge

  • TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

    Dynamic JSON Badge Stars Model

More Modalities

  • Galactica: A Large Language Model for Science

    Dynamic JSON Badge Stars Model

  • BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

    Dynamic JSON Badge Stars Model

  • DARWIN Series: Domain Specific Large Language Models for Natural Science

    Dynamic JSON Badge Stars Model

  • BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

    Dynamic JSON Badge Stars Model

  • (StructChem) Structured Chemistry Reasoning with Large Language Models

    Dynamic JSON Badge Stars

  • (BioTranslator) Multilingual translation for zero-shot biomedical classification using BioTranslator

    Dynamic JSON Badge Stars Model

  • Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

    Dynamic JSON Badge Stars Model

  • (ChatDrug) ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback

    Dynamic JSON Badge Stars

  • BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs

    Dynamic JSON Badge Stars Model

  • (KEDD) Towards Unified AI Drug Discovery with Multiple Knowledge Modalities

    Dynamic JSON Badge

  • (Otter Knowledge) Knowledge Enhanced Representation Learning for Drug Discovery

    Dynamic JSON Badge Stars Model

  • ChatCell: Facilitating Single-Cell Analysis with Natural Language

    Dynamic JSON Badge Stars Model

  • LangCell: Language-Cell Pre-training for Cell Identity Understanding

    Dynamic JSON Badge

  • BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

    Dynamic JSON Badge Stars Model

  • MolBind: Multimodal Alignment of Language, Molecules, and Proteins

    Dynamic JSON Badge Stars

  • Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

    Dynamic JSON Badge

  • Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains

    Dynamic JSON Badge Stars

  • An Evaluation of Large Language Models in Bioinformatics Research

    Dynamic JSON Badge

  • SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences

    Dynamic JSON Badge

Datasets & Benchmarks

Dataset Usage Modality Link
PubMed Pre-training Text https://pubmed.ncbi.nlm.nih.gov/download
bioRxiv Pre-training Text https://huggingface.co/datasets/mteb/raw_biorxiv,https://www.biorxiv.org/tdm
MedRxiv Pre-training Text https://www.medrxiv.org/tdm
S2ORC Pre-training Text https://github.com/allenai/s2orc
MIMIC Pre-training Text https://physionet.org/content/mimiciii/1.4
UF Health Pre-training Text https://idr.ufhealth.org
Elsevier Corpus Pre-training Text https://elsevier.digitalcommonsdata.com/datasets/zm33cdndxs/3
Eurpoe PMC Pre-training Text https://europepmc.org/downloads
LibreText Pre-training Text https://chem.libretexts.org
NLM literature archive Pre-training Text https://ftp.ncbi.nlm.nih.gov/pub/litarch/
GAP-Replay Pre-training Text -
ZINC Pre-training Molecule https://zinc15.docking.org, https://zinc20.docking.org
UniProt Pre-training Protein https://www.uniprot.org
ChEMBL Pre-training Molecule, Bioassay https://www.ebi.ac.uk/chembl
GIMLET Pre-training Molecule, Bioassay https://github.com/zhao-ht/GIMLET, https://huggingface.co/datasets/haitengzhao/molecule_property_instruction
PubChem Pre-training Text, Molecule, IUPAC, etc https://ftp.ncbi.nlm.nih.gov/pubchem
InterPT Pre-training Text, Protein https://huggingface.co/datasets/ProtLLM/ProtLLM
STRING Pre-training Text, Protein, etc https://string-db.org
BLURB Fine-tuning Text https://microsoft.github.io/BLURB
PubMedQA Fine-tuning Text https://github.com/pubmedqa/pubmedqa
SciQ Fine-tuning Text https://huggingface.co/datasets/sciq
BioASQ Fine-tuning Text http://participants-area.bioasq.org/datasets
MoleculeNet Fine-tuning Molecule https://moleculenet.org/datasets-1
MoleculeACE Fine-tuning Molecule https://github.com/molML/MoleculeACE
TDC Fine-tuning Molecule https://tdcommons.ai/
USPTO Fine-tuning Molecule https://yzhang.hpc.nyu.edu/T5Chem
Graph2graph Fine-tuning Molecule https://github.com/wengong-jin/iclr19-graph2graph/tree/master/data
PEER Fine-tuning Protein https://github.com/DeepGraphLearning/PEER_Benchmark
FLIP Fine-tuning Protein https://benchmark.protein.properties
TAPE Fine-tuning Protein https://github.com/songlab-cal/tape
PubChemSTM Fine-tuning Text, Molecule https://huggingface.co/datasets/chao1224/MoleculeSTM/tree/main
PseudoMD-1M Fine-tuning Text, Molecule https://huggingface.co/datasets/SCIR-HI/PseudoMD-1M
ChEBI-20 Fine-tuning Text, Molecule https://github.com/blender-nlp/MolT5
ChEBI-20-MM Fine-tuning Text, Molecule https://github.com/AI-HPC-Research-Team/SLM4Mol
ChEBL-dia Fine-tuning Text, Molecule https://github.com/Ellenzzn/ChatMol/tree/main/data/ChEBI-dia
L+M-24 Fine-tuning Text, Molecule https://github.com/language-plus-molecules/LPM-24-Dataset
PCdes Fine-tuning Text, Molecule https://github.com/thunlp/KV-PLM
MoMu Fine-tuning Text, Molecule https://github.com/yangzhao1230/GraphTextRetrieval
PubChemQA Fine-tuning Text, Molecule https://github.com/PharMolix/OpenBioMed
3D-MolT Fine-tuning Text, Molecule https://huggingface.co/datasets/Sihangli/3D-MoIT
MoleculeQA Fine-tuning Text, Molecule https://github.com/IDEA-XL/MoleculeQA
DrugBank Fine-tuning Text, Molecule, etc https://github.com/SCIR-HI/ArtificiallyR2R
SwissProt Fine-tuning Text, Protein https://www.expasy.org/resources/uniprotkb-swiss-prot
UniProtQA Fine-tuning Text, Protein https://github.com/PharMolix/OpenBioMed
SciEval Instruction Text https://github.com/OpenDFM/SciEval
BioInfo-Bench Instruction Text https://github.com/cinnnna/bioinfo-bench
MedC-I Instruction Text https://huggingface.co/datasets/axiong/pmc_llama_instructions
BioMedEval Instruction Text https://github.com/tahmedge/llm-eval-biomed
MolOpt-Instructions Instruction Text, Molecule https://github.com/blazerye/DrugAssist
SMolInstruct Instruction Text, Molecule https://github.com/OSU-NLP-Group/LLM4Chem
ChemLLMBench Instruction Text, Molecule https://github.com/ChemFoundationModels/ChemLLMBench
AI4Chem Instruction Text, Molecule https://github.com/andresilvapimentel/AI4Chem
GPTChem Instruction Text, Molecule https://github.com/kjappelbaum/gptchem
SLM4CRP_with_RTs Instruction Text, Molecule https://huggingface.co/datasets/liupf/SLM4CRP_with_RTs
DARWIN Instruction Text, Molecule, etc https://github.com/MasterAI-EAM/Darwin/tree/main/dataset
StructChem Instruction Text, Molecule, etc https://github.com/ozyyshr/StructChem
SciAssess Instruction Text, Molecule, etc https://sci-assess.github.io, https://github.com/sci-assess/SciAssess
InstructProtein Instruction Text, Protein -
Open Protein Instructions Instruction Text, Protein https://github.com/baaihealth/opi
Mol-Instructions Instruction Text, Molecule, Protein https://huggingface.co/datasets/zjunlp/Mol-Instructions
CheF - Text, Molecule https://github.com/kosonocky/CheF
IUPAC Gold Book - Text, Molecule https://goldbook.iupac.org
ChemNLP - Text, Molecule, etc https://github.com/OpenBioML/chemnlp
ChemFOnt - Text, Molecule, Protein, etc https://www.chemfont.ca

Related Resources

Related Surveys & Evaluations

  • A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery Arxiv 2406
  • Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecule Arxiv 2403
  • Bioinformatics and Biomedical Informatics with ChatGPT: Year One Review Arxiv 2403
  • From Words to Molecules: A Survey of Large Language Models in Chemistry Arxiv 2402
  • Scientific Language Modeling: A Quantitative Review of Large Language Models in Molecular Science Arxiv 2402
  • Progress and Opportunities of Foundation Models in Bioinformatics Arxiv 2402
  • Scientific Large Language Models: A Survey on Biological & Chemical Domains Arxiv 2401
  • The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4 Arxiv 2311
  • Transformers and Large Language Models for Chemistry and Drug Discovery Arxiv 2310
  • Language models in molecular discovery Arxiv 2309
  • What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks NeurIPS 2309
  • Do Large Language Models Understand Chemistry? A Conversation with ChatGPT JCIM 2303
  • A Systematic Survey of Chemical Pre-trained Models IJCAI 2023

Related Workshop

Related Repositories

Acknowledgements

This repository is contributed and updated by QizhiPei and Lijun Wu. If you have questions, don't hesitate to open an issue or ask me via qizhipei@ruc.edu.cn or Lijun Wu via lijun_wu@outlook.com. We are happy to hear from you!

Citations

@article{pei2024leveraging,
  title={Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey},
  author={Pei, Qizhi and Wu, Lijun and Gao, Kaiyuan and Zhu, Jinhua and Wang, Yue and Wang, Zun and Qin, Tao and Yan, Rui},
  journal={arXiv preprint arXiv:2403.01528},
  year={2024}
}

Star History Chart