The repository was developed by: Bjarke Christensen, Bertram Højer, Karl Kieler & Oliver Jarvis
This project is a variation of the 'ogbn-products' classification task [https://ogb.stanford.edu/docs/nodeprop/#ogbn-products]. We attempt to solve a node-level classification task. The original dataset is a graph of amazon products purchased together where node-level features are PCA derived representations of a bag-of-words embedding of product descriptions. We have updated the graph such that node-level features are instead BERT-embeddings and compare a simple MLP model to a ClusterGCN model.
In order to train the embeddings models, the dataset is needed. The dataset can downloaded from here.
Once downloaded, add the .csv to the data folder.
Instead of downloading the data, you can also pre-process it yourself. This requires downloading the following files, unpacking them and adding them to the data folder.
Amazon-3M: https://drive.google.com/file/d/1gsabsx8KR2N9jJz16jTcA0QASXsNuKnN/view?usp=sharing
nodeidx2asin.csv: https://drive.google.com/file/d/1hxWJ3e_Jfk9HCdOqKSEk3HMFqpylkgak/view?usp=share_link
Word embeddings are created using the 'SentenceTransformers' library for all original product descriptions. We used the MiniLM model [https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2]. All matching nodes in the graph have then been updated with the new embedding features. When updating the graph we go from 2.45M to 2.35M nodes.
The MLP models were trained in google colab notebooks:
-
Bag-of-words implementation: https://colab.research.google.com/drive/1CafDsj3n39SBXOpTZZtThuFbu5J8ZU4N?usp=sharing
-
Sbert implementation: https://colab.research.google.com/drive/1vu1sAFRL2sGWkHGQ0EWl1td0KNR6gdDd?usp=sharing
The MLP we train has a very simple architecture. We simply wanted to assess potential differences in using the graph structure of the problem for classification or simply using the embeddings. The difference in model size is due to differences in the embedding space of the BOW embeddings and the Sbert embeddings.
Embedding | Parameters | Acc. | F1 |
---|---|---|---|
BOW | 221,999 | 73.6 % | 72,2 % |
BERT | 367,457 | 86.6 % | 86.2 % |
General GCNs have a bottleneck in terms of efficiency and scalability. ClusterGCN attempts to combat this problem by partitioning the graph into subgraphs and performing convolutions on these to reduce time and memory complexity. The output of each cluster is aggregated to obtain the final prediction. The file for training is ogbn.py and ogbn_pca.py. The difference in model size is due to differences in the embedding space of the BOW embeddings and the Sbert embeddings.
Embedding | Parameters | Acc. | F1 |
---|---|---|---|
BOW | 75,576 | 87.5 % | 86.5 % |
BERT | 220,097 | 90.1 % | 90.6 % |
.
├── README.md
├── models
│ ├── bow-model-checkpoint-100.pt
│ ├── embedding-model-checkpoint-150.pt
│ ├── gcn_model_embedding_2layer_epoch500.pt
│ └── gcn_model_pca_2layer_epoch250.pt
├── requirements.txt
└── src
├── create_dataset.py
├── create_embeddings.py
├── create_graph_dataset.py
├── evaluation.ipynb
├── ogbn.py
└── ogbn_pca.py