What are units and normalization factors in QM9 dataset?
Nokimann opened this issue · 3 comments
I used the following code:
from jarvis.db.figshare import data
d = data('qm9_std_jctc')
The 1st data in QM9 dataset obtained from JARVIS:
{'mu': -1.77790756800166,
'alpha': -7.59467417670514,
'HOMO': -6.71425764235072,
'LUMO': 2.24686567442436,
'gap': 5.35591684810335,
'R2': -4.11464477806684,
'ZPVE': -3.14893653207103,
'U0': 5.70989371834825,
'U': 5.69336539320842,
'H': 5.68508295617329,
'G': 5.75764468354196,
'Cv': -6.18353212813309,
'omega1': -1.3203823354756,
'SMILES': 'C',
'SMILES_relaxed': 'C',
'id': '000001',
'atoms': {'lattice_mat': [[60, 0, 0], [0, 60, 0], [0, 0, 60]],
'coords': [[0.4999998496686667, 0.5000001250963333, 0.4999999923633333],
[0.5002473255336667, 0.481802867173, 0.4998995777733333],
[0.5170736659886667, 0.5062992418296667, 0.4998712520133333],
[0.49119790078366665, 0.5060288326963334, 0.48525591384666666],
[0.4914812580253333, 0.5058689332046666, 0.5149732640033333]],
'elements': ['C', 'H', 'H', 'H', 'H'],
'abc': [60.0, 60.0, 60.0],
'angles': [90.0, 90.0, 90.0],
'cartesian': False,
'props': ['', '', '', '', '']}}
And, the original 1st data in QM9 dataset with description:
5
gdb 1 157.7118 157.70997 157.70699 0. 13.21 -0.3877 0.1171 0.5048 35.3641 0.044749 -40.47893 -40.476062 -40.475117 -40.498597 6.469
C -0.0126981359 1.0858041578 0.0080009958 -0.535689
H 0.002150416 -0.0060313176 0.0019761204 0.133921
H 1.0117308433 1.4637511618 0.0002765748 0.133922
H -0.540815069 1.4475266138 -0.8766437152 0.133923
H -0.5238136345 1.4379326443 0.9063972942 0.133923
1341.307 1341.3284 1341.365 1562.6731 1562.7453 3038.3205 3151.6034 3151.6788 3151.7078
C C
InChI=1S/CH4/h1H4 InChI=1S/CH4/h1H4
Line Content
---- -------
1 Number of atoms na
2 Properties 1-17 (see below)
3,...,na+2 Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom
na+3 Frequencies (3na-5 or 3na-6)
na+4 SMILES from GDB9 and for relaxed geometry
na+5 InChI for GDB9 and for relaxed geometry
The properties stored in the second line of each file:
I. Property Unit Description
-- -------- ----------- --------------
1 tag - "gdb9"; string constant to ease extraction via grep
2 index - Consecutive, 1-based integer identifier of molecule
3 A GHz Rotational constant A
4 B GHz Rotational constant B
5 C GHz Rotational constant C
6 mu Debye Dipole moment
7 alpha Bohr^3 Isotropic polarizability
8 homo Hartree Energy of Highest occupied molecular orbital (HOMO)
9 lumo Hartree Energy of Lowest occupied molecular orbital (LUMO)
10 gap Hartree Gap, difference between LUMO and HOMO
11 r2 Bohr^2 Electronic spatial extent
12 zpve Hartree Zero point vibrational energy
13 U0 Hartree Internal energy at 0 K
14 U Hartree Internal energy at 298.15 K
15 H Hartree Enthalpy at 298.15 K
16 G Hartree Free energy at 298.15 K
17 Cv cal/(mol K) Heat capacity at 298.15 K
I. = Property index (properties are given in this order)
For the 6095 isomers, properties 12-16 were calculated at the G4MP2 level of theory.
All other calculations were done at the DFT/B3LYP/6-31G(2df,p) level of theory.
I found the units are converted and normalized
For example, for homo, lumo, ...
Hartree -> eV, and then normalized from the entire data with mean and std
How could I get a unit and mean/std factors for each property?
Hi,
The QM9 dataset is adapted from GDrive link from Faber et al.. They provide the mean/std in qm9-prop-stats-v1
file and the normalized dataset in qm9-mol-info-standardized-v1
file.
The units can be found in Faber et al. (Table 3 and 4), or Choudhary et al. (Table 5).
I don't think it's a good idea to provide only standardized data, as it invites the same evaluation error as in ALIGNN. I've observed this confusion between scaled and original data (and inner energy vs. atomization energy) on QM9 in multiple previous papers as well.
It would be great if you would instead provide the data in real units, as done e.g. by PyG: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.QM9