aspuru-guzik-group/Tartarus

Dataset for Surrogate Models

Closed this issue · 6 comments

Hi,

Which dataset did you use to train your surrogate models for the "Design of Organic Photovoltaics" tasks? In your SI, section B.1.b there is a link to "http://github.com/HIPS/neural-fingerprint" but, there, I can only find the CEP database power conversion efficiency data.

Is it hce.csv? If so, does this hce.csv file contain what you call CEP_SUB in the paper? Or did you use the full CEPDB (with 2.3 million molecules) to train the surrogate models?

Thanks for these benchmarks, they are very useful!

Hi @ftherrien

The dataset is located in gdb13.csv; if you can wait for a few days, we are also about to update the code, making the calculations & molecules more feasible.

This should be done over the next few days. Additionally, thank you for pointing out the error -- we will have a look & update the manuscript.

Regards
Akshat

gdb13.csv seems to be for organic emitters are you sure it is not hce.csv?

Oops I misread your question @ftherrien :)

You are correct. The dataset is in hce.csv (gdb13 is for a different task). The link http://github.com/HIPS/neural-fingerprint was used to get all the smile strings . hce.csv only contains CEP_SUB (the model was trained only on this & not on 2.3million molecules).

Regards
Akshat

That makes sense thanks!

In the exploratory task still for the Design of Organic Photovoltaics (Table II) you used 1000 randomly generated molecules. 1. Is that dataset available? 2. Did you train the surrogate models on them?

  1. Yes: datasets/hce_unbiased.csv
  2. No -- we use the same trained model (as the biased task) :)

Great, thanks for your quick replies!