/semantic-document-relations

Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"

Primary LanguagePythonMIT LicenseMIT

Semantic Relations between Wikipedia Articles

Open In Colab DOI

Implementation, trained models and result data for the paper Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (PDF on Arxiv). The supplemental material is available for download under GitHub Releases or Zenodo.

Wikipedia Relations

Getting started

Requirements:

  • Python >= 3.7 (Conda)
  • Jupyter notebook (for evaluation)
  • GPU with CUDA-support (for training Transformer models)

At first we advise to create a new virtual environment for Python 3.7 with Conda:

conda create -n docrel python=3.7
conda activate docrel

Install all Python dependencies:

pip install -r requirements.txt

Download dataset (and pretrained models):

# Navigate to data directory
cd data

# Wikipedia corpus
# - download
wget https://github.com/malteos/semantic-document-relations/releases/download/1.0/enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2

# - decompress 
bzip2 -d enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2

# Train and test data
# - download
wget https://github.com/malteos/semantic-document-relations/releases/download/1.0/train_testdata__4folds.tar.gz

# - decompress
tar -xzf train_testdata__4folds.tar.gz

# Models
# - download
wget https://github.com/malteos/semantic-document-relations/releases/download/1.0/model_wiki.bert_base__joint__seq512.tar.gz

# - decompress
tar -xzf model_wiki.bert_base__joint__seq512.tar.gz

Experiments

Run predefined experiment (settings can be found in experiments/predefined/wiki)

# Config: wiki.bert_base__joint__seq128
# GPU ID: 1 (set via CUDA_VISIBLE_DEVICES=1)
# Output dir: ./output
python cli.py run ./output 1 wiki.bert_base__joint__seq512

Demo

You can run a Jupyter notebook on Google Colab:

Open In Colab

How to cite

If you are using our code, please cite our paper:

@InProceedings{Ostendorff2020,
  title = {Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles},
  booktitle = {Proceedings of the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},
  author = {Ostendorff, Malte and Ruas, Terry and Schubotz, Moritz and Gipp, Bela},
  year = {2020},
  month = {Aug.},
}

See also

License

MIT