usnistgov/jarvis

What are units and normalization factors in QM9 dataset?

Nokimann opened this issue · 3 comments

I used the following code:

from jarvis.db.figshare import data
d = data('qm9_std_jctc')

The 1st data in QM9 dataset obtained from JARVIS:

{'mu': -1.77790756800166,
 'alpha': -7.59467417670514,
 'HOMO': -6.71425764235072,
 'LUMO': 2.24686567442436,
 'gap': 5.35591684810335,
 'R2': -4.11464477806684,
 'ZPVE': -3.14893653207103,
 'U0': 5.70989371834825,
 'U': 5.69336539320842,
 'H': 5.68508295617329,
 'G': 5.75764468354196,
 'Cv': -6.18353212813309,
 'omega1': -1.3203823354756,
 'SMILES': 'C',
 'SMILES_relaxed': 'C',
 'id': '000001',
 'atoms': {'lattice_mat': [[60, 0, 0], [0, 60, 0], [0, 0, 60]],
  'coords': [[0.4999998496686667, 0.5000001250963333, 0.4999999923633333],
   [0.5002473255336667, 0.481802867173, 0.4998995777733333],
   [0.5170736659886667, 0.5062992418296667, 0.4998712520133333],
   [0.49119790078366665, 0.5060288326963334, 0.48525591384666666],
   [0.4914812580253333, 0.5058689332046666, 0.5149732640033333]],
  'elements': ['C', 'H', 'H', 'H', 'H'],
  'abc': [60.0, 60.0, 60.0],
  'angles': [90.0, 90.0, 90.0],
  'cartesian': False,
  'props': ['', '', '', '', '']}}

And, the original 1st data in QM9 dataset with description:

5
gdb 1	157.7118	157.70997	157.70699	0.	13.21	-0.3877	0.1171	0.5048	35.3641	0.044749	-40.47893	-40.476062	-40.475117	-40.498597	6.469	
C	-0.0126981359	 1.0858041578	 0.0080009958	-0.535689
H	 0.002150416	-0.0060313176	 0.0019761204	 0.133921
H	 1.0117308433	 1.4637511618	 0.0002765748	 0.133922
H	-0.540815069	 1.4475266138	-0.8766437152	 0.133923
H	-0.5238136345	 1.4379326443	 0.9063972942	 0.133923
1341.307	1341.3284	1341.365	1562.6731	1562.7453	3038.3205	3151.6034	3151.6788	3151.7078
C	C	
InChI=1S/CH4/h1H4	InChI=1S/CH4/h1H4
Line       Content
----       -------
1          Number of atoms na
2          Properties 1-17 (see below)
3,...,na+2 Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom
na+3       Frequencies (3na-5 or 3na-6)
na+4       SMILES from GDB9 and for relaxed geometry
na+5       InChI for GDB9 and for relaxed geometry

The properties stored in the second line of each file:

I.  Property  Unit         Description
--  --------  -----------  --------------
 1  tag       -            "gdb9"; string constant to ease extraction via grep
 2  index     -            Consecutive, 1-based integer identifier of molecule
 3  A         GHz          Rotational constant A
 4  B         GHz          Rotational constant B
 5  C         GHz          Rotational constant C
 6  mu        Debye        Dipole moment
 7  alpha     Bohr^3       Isotropic polarizability
 8  homo      Hartree      Energy of Highest occupied molecular orbital (HOMO)
 9  lumo      Hartree      Energy of Lowest occupied molecular orbital (LUMO)
10  gap       Hartree      Gap, difference between LUMO and HOMO
11  r2        Bohr^2       Electronic spatial extent
12  zpve      Hartree      Zero point vibrational energy
13  U0        Hartree      Internal energy at 0 K
14  U         Hartree      Internal energy at 298.15 K
15  H         Hartree      Enthalpy at 298.15 K
16  G         Hartree      Free energy at 298.15 K
17  Cv        cal/(mol K)  Heat capacity at 298.15 K

I. = Property index (properties are given in this order)
For the 6095 isomers, properties 12-16 were calculated at the G4MP2 level of theory.
All other calculations were done at the DFT/B3LYP/6-31G(2df,p) level of theory.

I found the units are converted and normalized
For example, for homo, lumo, ...
Hartree -> eV, and then normalized from the entire data with mean and std

How could I get a unit and mean/std factors for each property?

knc6 commented

Hi,

The QM9 dataset is adapted from GDrive link from Faber et al.. They provide the mean/std in qm9-prop-stats-v1 file and the normalized dataset in qm9-mol-info-standardized-v1 file.
The units can be found in Faber et al. (Table 3 and 4), or Choudhary et al. (Table 5).

Thank you @knc6
We can't directly load the mean/std from JARVIS now?

I don't think it's a good idea to provide only standardized data, as it invites the same evaluation error as in ALIGNN. I've observed this confusion between scaled and original data (and inner energy vs. atomization energy) on QM9 in multiple previous papers as well.

It would be great if you would instead provide the data in real units, as done e.g. by PyG: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.QM9