
Crash when loading ogbn_proteins

JonasDeSchouwer opened this issue · 1 comments

I try to execute the following line:

ogb_dataset = NodePropPredDataset(name="ogbn-proteins", root=f"{datasets/data/ogb")

This starts off doing what it is supposed to:

  • it downloads from the correct url
  • it extracts this zip into the directory datasets/data/ogb/ogbn_proteins with subdirectories mapping, raw, processed, split
  • it loads the graph and labels, and preprocesses them.

However, as soon as it gets to the line{'graph': self.graph, 'labels': self.labels}, pre_processed_file_path, pickle_protocol=4)

in ogb/nodeproppred/ (= line 135 in the version I am running), the program crashes without any error messages, and only an empty file is saved to datasets/data/ogb/ogbn_proteins/processed/data_processed.

I have been able to reproduce this by just loading self.graph and self.labels in a notebook by executing the following code:

graph = read_csv_graph_raw(raw_dir, add_inverse_edge=True, additional_node_files=['node_species'], additional_edge_files=[])[0]
labels = pd.read_csv(osp.join(raw_dir, 'node-label.csv.gz'), compression='gzip', header=None).values

Then, I can save labels and graph["node_species"] to a file without problem, but as soon as I try to save anything containing graph["edge_index"] or graph["edge_feat"] to a file, the kernel crashes. Note that these have large sizes: (2, 79122504) for graph["edge_index"] and (79122504, 8) for graph["edge_feat"]. All matrices look pretty normal to me, so my guess is that this is a problem with not being able to handle large files (yet the matrices are smaller than the max size reported in this issue). Yet I thought it will be useful to let you know this and perhaps find a workaround.


  • Ubuntu 20.04.6 LTS
  • Python 3.12.3
  • torch 2.2.2+cu121

Output from conda:

To reproduce this issue:

In the terminal:

conda create -n test_save_env
conda activate test_save_env
conda install python=3.12
pip install ogb==1.3.6

Note that ogb has torch as a dependency, so in my case it installs torch 2.3.1. But I observed the same behaviour with torch 2.2.2+cu121.

Then run the following Python code:

from import read_csv_graph_raw
import pandas as pd
import os.path as osp
import torch

raw_dir = "datasets/data/ogb/ogbn_proteins/raw"

graph = read_csv_graph_raw(raw_dir, add_inverse_edge=True, additional_node_files=['node_species'], additional_edge_files=[])[0]
labels = pd.read_csv(osp.join(raw_dir, 'node-label.csv.gz'), compression='gzip', header=None).values

In my case, this gives the following error (in a notebook):

The Kernel crashed while executing code in the current cell or a previous cell. 
Please review the code in the cell(s) to identify a possible cause of the failure. 
Click [here]( for more info. 
View Jupyter [log](command:jupyter.viewOutput) for further details.