/TextSSL

[AAAI 2022] Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification

Primary LanguagePythonMIT LicenseMIT

Sparse Structure Learning via Graph Neural Networks for inductive document classification

Paper Github License License

Figure 1. The architecture of TextSSL.

About data

We use the same benchmark datasets that are used in Yao, Mao, and Luo 2019, where we follow the same train/test splits and data preprocessing for MR, Ohsumed and 20NG datasets as Kim 2014; Yao, Mao, and Luo 2019. Thanks for their work.

For R8 and R52 datasets, they are only provided by a preprocessed version that lack punctuations and do not have explicit sample names. Since we use documents with sentence segmentation information to construct graph, we re-extract the data from original Reuters-21578 dataset.

You can download the dataset here:

  1. re-extract R8 and R52 datasets.
    python re-extract_data/mk_R8_R52.py --name R8
    
  2. remove words.
    python remove_words.py --name R8
    

About path

To run the code, you should change Your_path=/data/project/yinhuapark/ssl/ to your own path.


Make graph dataset

  1. create co-occurrence pairs of each documents.
    python ssl_make_graphs/create_cooc_document.py --name R8 
    
  2. construct graphs of each documents in InMemoryDatset.
    python ssl_make_graphs/PygDocsGraphDataset.py --name R8 
    

Train

python ssl_graphmodels/pyg_models/train_docs.py --name R8

Reference

If you find our paper and repo useful, please cite our paper:

@inproceedings{piao2022sparse,
  title={Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification},
  author={Piao, Yinhua and Lee, Sangseon and Lee, Dohoon and Kim, Sun},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={36},
  number={10},
  pages={11165--11173},
  year={2022}
}

The readme is inspired by GSAT.