LLM4Mol

LLM(Large Language Model)4Mol is a comprehensive repository dedicated to the collection and exploration of studies utilizing large language models for molecular design, protein research, and material science. This repository serves as a central hub for researchers, scientists, and enthusiasts interested in leveraging the power of language models for advancing our understanding and applications in these domains. Discover state-of-the-art techniques, novel approaches, and cutting-edge research papers that harness the potential of AI-powered language models in unraveling the complexities of Biomedical Text, RNA/DNA, Molecules, Peptides, Proteins, Antibody, and Materials. Join our vibrant community and contribute to the exciting advancements in the field of LLM4Mol!

🔔Updating ...

Recommendations and references

Generative AI and Deep Learning for molecular/drug design
https://github.com/AspirinCode/papers-for-molecular-design-using-DL

List of papers about Proteins Design using Deep Learning
https://github.com/Peldom/papers_for_protein_design_using_DL

Large Language Models in Chemistry
https://github.com/alxfgh/Large-Language-Models-in-Chemistry

LLM4Biomedical Text

Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health [2023]
Tian, Shubo, Qiao Jin, Lana Yeganova, Po-Ting Lai, Qingqing Zhu, Xiuying Chen, Yifan Yang et al.
arXiv:2306.10070 (2023)
Large language models are universal biomedical simulators [2023]
Schaefer, Moritz, Stephan Reichl, Rob ter Horst, Adele M. Nicolas, Thomas Krausgruber, Francesco Piras, Peter Stepper, Christoph Bock, and Matthias Samwald.
bioRxiv (2023) | code
Fine-tuning large neural language models for biomedical natural language processing [2023]
Tinn, Robert, Hao Cheng, Yu Gu, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon.
Patterns 4.4 (2023) | code
A Platform for the Biomedical Application of Large Language Models [2023]
Lobentanzer, Sebastian, and Julio Saez-Rodriguez.
arXiv:2305.06488v2 | code
Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations [2023]
Chen, Qingyu, Jingcheng Du, Yan Hu, Vipina Kuttichi Keloth, Xueqing Peng, Kalpana Raja, Rui Zhang, Zhiyong Lu, and Hua Xu.
arXiv:2305.16326v1 | code
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks [2023]
Zhang, K., Yu, J., Yan, Z., Liu, Y., Adhikarla, E., Fu, S., ... & Sun, L.
arXiv:2305.17100v1 | code
BioMedLM: a Domain-Specific Large Language Model for Biomedical Text [2022]
Paper | code

LLM4Small Molecule

Empowering Molecule Discovery for Molecule-Caption Translation with LargeLanguage Models: A ChatGPT Perspective [2023]
Jiatong Li, Yunqing Liu, Wenqi Fan, Xiao-Yong Wei, Hui Liu, Jiliang Tang, Qing Li
arXiv:2306.06615 (2023) | code
Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language [2023]
Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, Huajun Chen
arXiv:2303.03363 (2023) | code
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models [2023]
Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, Huajun Chen
arXiv:2306.08018v1 | code
MolReGPT: Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [2023]
Li, Jiatong, Yunqing Liu, Wenqi Fan, Xiao-Yong Wei, Hui Liu, Jiliang Tang, and Qing Li.
arXiv:2306.06615v1 | code

LLM4RNA/DNA

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [2023]
Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Chris Ré.
arXiv:2306.15794v1
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome [2021]
Ji, Yanrong, Zhihan Zhou, Han Liu, and Ramana V. Davuluri.
Bioinformatics 37.15 (2021) | code

LLM4Peptide

LMPred: predicting antimicrobial peptides using pre-trained language models and deep learning [2022]
Dee, William.
Bioinformatics Advances 2.1 (2022) | code

LLM4Protein

Protein-Protein Interaction Prediction is Achievable with Large Language Models [2023]
Hallee, Logan, and Jason P. Gleghorn.
bioRxiv (2023)
Prediction of virus-host association using protein language models and multiple instance learning [2023]
Liu, Dan, Francesca Young, David L. Robertson, and Ke Yuan.
bioRxiv (2023) | code
Large language models generate functional protein sequences across diverse families [2023]
Madani, Ali, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos Jr et al.
Nat Biotechnol (2023) | code

LLM4Antibody

On Pre-training Language Model for Antibody [2023]
Wang, Danqing, Y. E. Fei, and Hao Zhou.
ICLR (2023) | code
Efficient evolution of human antibodies from general protein language models [2023]
Hie, Brian L., Varun R. Shanker, Duo Xu, Theodora UJ Bruun, Payton A. Weidenbacher, Shaogeng Tang, Wesley Wu, John E. Pak, and Peter S. Kim.
Nat Biotechnol (2023) | code
AbLang: an antibody language model for completing antibody sequences [2022]
Olsen, Tobias H., Iain H. Moal, and Charlotte M. Deane.
Bioinformatics Advances (2022) | code

LLM4Clinical

Matching Patients to Clinical Trials with Large Language Models [2023]
Jin, Qiao, Zifeng Wang, Charalampos S. Floudas, Jimeng Sun, and Zhiyong Lu.
arXiv:2307.15051 (2023)
ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation [2023]
Wang, Danqing, Y. E. Fei, and Hao Zhou.
arXiv:2306.09968v1

LLM4Chemistry

ChemCrow: Augmenting large-language models with chemistry tools [2023]
Bran, Andres M., Sam Cox, Andrew D. White, and Philippe Schwaller.
arXiv:2304.05376 (2023) | code

LLM4Material

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT [2023]
Xie, Tong, Yuwei Wa, Wei Huang, Yufei Zhou, Yixuan Liu, Qingyuan Linghu, Shaozhou Wang, Chunyu Kit, Clara Grazian, and Bram Hoex.
arXiv:2304.02213v5
MatSciBERT: A materials domain language model for text mining and information extraction [2022]
Gupta, Tanishq, Mohd Zaki, NM Anoop Krishnan, and Mausam.
npj Comput Mater 8, 102 (2022) | code

jinzhuwei/LLM4Mol