(1) Dataset: Harvard Clean Engery Project Dataset (HCEP) Directory: HCEP-50K/ Files: - 50K.basics: Basic atomic type (C, H, Se, etc.) - 50K.train.graph: Training graphs with adjacency matrices - 50K.train.atoms: Training graphs with atomic types (for each vertex) - 50K.train.pce: Training learing taget (PCE value) for each molecule - 50K.test.graph: Testing graphs with adjacency matrices - 50K.test.atoms: Testing graphs with atomic types (for each vertex) - 50K.test.pce: Testing learing taget (PCE value) for each molecule - 50K.val.graph: Validation graphs with adjacency matrices - 50K.val.atoms: Validation graphs with atomic types (for each vertex) - 50K.val.pce: Validation learing taget (PCE value) for each molecule (2) dK-Series features is in directory HCEP-50K-dKSeries/: - File 50K.train.dk-features.[value of d]: dK feature for each molecule in the training set (given the value of d) - File 50K.test.dk-features.[value of d]: dK feature for each molecule in the testing set (given the value of d) - File 50K.val.dk-features.[value of d]: dK feature for each molecule in the validation set (given the value of d) To extract dK-Series feature for the HCEP dataset: $ g++ dk_series_feature_extraction.cpp $ ./a.out [value of d] It will write outputs to files 50K.train.dk-features.[value of d], 50K.test.dk-features.[value of d], 50K.val.dk-features.[value of d] in the directory HCEP-50K-dKSeries/. However, you do not have to extract the dK-Series features again. The features are there in the HCEP-50K-dKSeries/ directory. Warning: For d >= 4, it will take a long time running! (3) Linear Regression in MATLAB: Run linear_regression.m given 1 input parameter that is d. (4) Visualize the histogram between predicted values and ground-truth values in MATLAB: Run multi_histograms.m given 1 input parameter that is d. (5) Visualize the dK-Series features in low-dimensional with Principle Component Analysis (PCA) in MATLAB: Run PCA_visual.m given 1 input parameter that is d. (6) Generate all non-isomorphic graphs (by Back-Tracking algorithm) of size <= d: $ g++ find_non_isomorphic_graphs.cpp $ ./a.out [value of d] Warning: d >= 7 would take very long time! (7) Class Graph.h: Data structure to store graphs (8) Class GraphIsomorphism.h: Algorithm to check whether two graphs are isomorphic (by Back-Tracking algorithm) (9) Presentation slides (in Latex) in directory Presentation/
HyTruongSon/dk-series
Application of dK-series into molecular energy regression. Reference: "Systematic Topology Analysis and Generation Using Degree Correlations". ACM SIGCOMM, September 2006.
Roff