This project is a re-implementation of the SGC described in Simplifying Graph Convolutional Networks from ICML2019. The main goal of this project is to verify the claims from the paper and extend the experiments on SGC and GCN across several page-page datasets.
See Evaluation of Simple and Deep Graph Convolutional Network Variants.
- Download the datasets in the Datasets section.
- Upload
simple_gcn_project.ipynb
to a Google Colab instance, preferably from Google Drive. - Mount the datasets to the Google Colab instance; this can be done easily through Google Drive and Section 2 in the notebook.
- Change the runtime type to
T4
. - Connect to a hosted
T4
runtime. - Run the following Sections in the notebook to setup the environment, configure the models, and load necessary functions:
- Section 1: Global Dependencies.
- Section 2: Mount Datasets if not already done.
- Section 3: Random Seed.
- Section 4: Models.
- Section 5: Normalization.
- Section 6: Analysis.
- Section 7: Loading Datasets if running Citation Network tests.
- Section 8: Train & Test Functions.
- Section 9: Results Handling.
- Section 11: Extended Dataset Loading if running Extended Dataset tests.
- Run the following Sections in the notebook to run tests:
- Section 10: Citation Network Testing for SGC and GCN if running Citation Network tests.
- Section 12: Extended Dataset Testing on SGC if running Extended Dataset tests.
- Section 13: Extended Dataset Testing on GCN if running Extended Dataset tests.
- Wait for the tests to finish.
simple_gcn_project.ipynb
is structured as 13 distinct sections:
- Section 1: Global Dependencies
- Dependencies used across most sections.
- Section 2: Mount Datasets
- Allows mounting a personal Google Drive to
/content/drive
of the Google Colab environment.
- Allows mounting a personal Google Drive to
- Section 3: Random Seed
- A function to set the
numpy
andtorch
random seeds. - Reference: PyTorch Docs: Reproducibility.
- A function to set the
- Section 4: Models
- Models for SGC, GCL, and GCN.
- Reference: SGC/models.py.
- Section 5: Normalization
- Functions to convert SciPy sparse matrices to PyTorch tensors, normalize adjacency matrices, and normalize matrix rows.
- Reference: SGC/normalization.py.
- Section 6: Analysis
- A simple accuracy function.
- Reference: SGC/metrics.py
- Section 7: Loading Datasets
- Functions to parse lists from index files to build graph structures, load the citation networks, pre-process the adjacency matrices.
- Reference: SGC/utils.py. Removed
citeseer
specific node isolation modifications and adjacency matrix pre-processing out of the citation network loading function for training/testing of GCN.
- Section 8: Train & Test Functions
- Functions to train and test SGC and GCN with configurable settings. SGC training allows for configurable optimizer; GCN set in the training function to use negative log likelihood loss (torch.nn.NLLLoss) due to the configuration mentioned in the original GCN paper.
- Section 9: Results Handling
- Function to print out the train and test results in an easy to read table.
- Section 10: Citation Network Testing for SGC and GCN.
- This section requires a
sgcn_data
folder with the Cora and Pubmed datasets, see Datasets for a link to this dataset. - The first code section configures the SGC model, imports the dataset, pre-processes the adjacency matrix, trains the model with the cross entropy loss function (torch.nn.CrossEntropyLoss) and the Adam algorithm optimizer (torch.optim.Adam), and produces average training time and test accuracy results over 20 runs with 100 epochs each.
- The second code section configures the GCN model, imports the datasets, trains the model with Adam algorithm optimizer (torch.optim.Adam), and produces the average training time and test accuracy results over 20 runs with 100 epochs each.
- This section requires a
- Section 11: Extended Dataset Loading
- A function to import the extended page-page datasets and random training/validation/testing masks from the specified split file.
- Reference: Geom-GCN/utils_data.py. Limited down to the four page-page datasets intended for extension tests: Chameleon, Cornell, Texas, and Wisconsin and removed the Geom-GCN specific embedding mode specifications.
- Section 12: Extended Dataset Testing on SGC
- This section requires a
new_data
folder with Chameleon, Cornell, Texas, and Wisconsin datasets in individual subfolders with those lowercase names, see Datasets for links to these datasets. This section follows the same configuration as Section 10 for SGC, but runs 10 random splits for training/validation/testing nodes and produces the average training time and test results over 20 runs with 100 epochs each. The post run results are shown for each random split and the overall training time and test accuracy averages.
- This section requires a
- Section 13: Extended Dataset Testing on GCN
- This section has the same requirement as Section 12 for the
new_data
folder with the page-page datasets stored in correctly named subfolders. This section follows the same configuration as Section 10 for GCN, but runs 10 random splits for training/validation/testing nodes and produces the average training time and test accuracy over 20 runs with 100 epochs each. Post run results are printed for each random split and the overall training time and test accuracy averages, as in Section 12.
- This section has the same requirement as Section 12 for the
Cora and Pubmed - Citation Networks The original dataset source can be found at tkipf/gcn:gcn/data. The dataset used for this project can be found at Tiiiger/SGC:data.
Cornell, Texas, Wisconsin can be found at bingzhewei/geom-gcn:new_data.
Chameleon can be found at chennnM/GCNII:new_data/chameleon.
Combine the new_data
folders to have a single new_data
folder with all four datasets.