The goal of this project is to train a neural network model on the QM7 dataset and evaluate its performance using mean absolute error (MAE) as the evaluation metric. The project includes the implementation of Machine Learning models, MLP, GCN, data processing functions, and training procedures.
The QM7 dataset contains quantum mechanical calculations of molecular energies for a set of organic molecules. The dataset is stored in the 'data/qm7.mat' file.
The project requires the following dependencies:
- Python 3.x
- PyTorch
- NumPy
- Matplotlib
- tqdm
The project consists of the following files and directories:
model/gnn.py
: Contains the GNN (Graph Neural Network) model implementation.model/ml.py
: Contains the ML (Machine Learning) model implementations, including linear regression, Gaussian process regression, kernel ridge regression, and support vector regression.utils.py
: Contains utility functions for data preprocessing and visualization.data/qm7.mat
: The input dataset in MATLAB format.exps/
: Output directory for experiment results, including weights and visualization plots.
-
Clone the repository:
git clone https://github.com/frankielp/QM7
-
Install the required dependencies:
pip install -r requirements.txt
-
Prepare the data:
- Place the QM7 dataset file ('qm7.mat') in the 'data/' directory.
-
Run the
train.py
script to train and evaluate the models.python train.py --model <model_name> --data <data_type>
Replace
<model_name>
with the desired model:linear_regression
,gaussian_process_regression
,kernel_ridge_regression
,support_vector_regression
, ormlp
.Replace
<data_type>
with the desired data type for the ML model:eigenspectrum
(sorted data) orrandom
(randomly shuffled data).Example:
python train.py model=mlp data=eigenspectrum
This command will train the MLP model on sorted eigenspectrum data.
-
The trained models and evaluation results will be saved in the
exps
directory.
-
The weights and loss plots will be saved in the 'model_exp_version/weight/' and 'model_exp_version/visualize/' directories, respectively.
-
The experiment results, including the MAE for each fold and the training log, will be saved in the
exps/mae_log.txt
file.
-
Some models requires CUDA-compatible GPUs for faster training. If CUDA is not available, the model will be trained on the CPU.
-
The MLP and GNN model training process may take a long time, depending on the size of the dataset and the number of iterations. The current configuration uses 5-fold cross-validation and 50000 iterations.
-
The MAE (Mean Absolute Error) for each fold and the final MAE will be printed after the training process is complete.
This project uses the QM7 dataset. For more information about the dataset, refer to the original source.
This project is licensed under the MIT License. See the LICENSE
file for more information.