Welcome to the official repository for M3-20M, the first large-scale Multi-Modal Molecular dataset, containing over 20 million molecules! 🎉
M3-20M (Multi-Modal Molecular dataset) is designed to support AI-driven drug design and discovery. It provides an unprecedented scale that highly benefits the training or fine-tuning of large models for superior performance in drug design and discovery tasks.
- Scale: Contains over 20 million molecules, 71 times more than the largest existing dataset.
- Comprehensive Modalities:
- One-dimensional SMILES strings
- Two-dimensional molecular graphs
- Three-dimensional molecular structures
- Physicochemical properties
- Text descriptions
- Diverse Applications: Supports various downstream tasks such as molecule generation, molecular property prediction, lead optimization, virtual screening, pharmacokinetics modeling, and drug-target interaction prediction.
M3-20M integrates data from multiple sources to provide a comprehensive view of each molecule. Here’s what you can find in the dataset:
- M^3_Original.csv: Descriptions from PubChem
- M^3_Physicochemical.csv: Physicochemical properties
- M^3_Description_Physicochemical.csv: Descriptions composed of physicochemical properties
- M^3_Multi.csv: Descriptions from PubChem, physicochemical properties, and those generated by GPT-3.5
- MPP folder: Contains multimodal datasets for molecular property prediction (BBBP-MM, BACE-MM, HIV-MM, ClinTox-MM, Tox21-MM)
- MOSES-Multi folder: Contains MOSES multimodal datasets for molecular generation
- QM9-Multi folder: Contains QM9 multimodal datasets
We provide convenient functions that allow you to easily obtain the dataset, as well as the 2D and 3D representations of any molecule outside the dataset. The specific functions can be found in the Function folder.
The dataset is available for download from multiple sources:
- Google Drive: Download Link
- Baidu Cloud: Download Link password:ADMS
- Hugging Face: Download Link
Here’s a simple example of how to load and explore the dataset:
import pandas as pd
# Load the dataset
df = pd.read_csv('path-to-dataset.csv')
# Display the first few rows
print(df.head())
We welcome contributions from the community! Feel free to submit issues or pull requests to help improve the dataset and its applications.
This project is licensed under the MIT License - see the LICENSE file for details.
This dataset is a collaborative effort by researchers from Tongji University and Fudan University. We thank Lexuan Wang, Chang Jin, Jinxian Wang, Han Peng, Huayang Shi, Wengen Li, Jihong Guan, and Shuigeng Zhou for their contributions and support.
For any questions or inquiries, please reach out to gsy9901224@tongji.edu.cn.
Enjoy using M3-20M and happy researching! 🚀🔬