/ConceptExtraction

Primary LanguagePythonApache License 2.0Apache-2.0

Concept Extraction Using Pointer-Generator Networks and Distant Supervision for Data Augmentation

Overview

This repository contains the code for running concept extraction using pretrained models described in the paper "Concept Extraction Using Pointer-Generator Networks and Distant Supervision for Data Augmentation". The preview of the dataset used for training and evaluation of concept extraction models is included. The whole dataset can be downloaded from Google Drive (~600 MB): concept_extraction_dataset.zip. Details about the models and the dataset can be found in the paper.

Citation:
Shvets, A. and Wanner, L. 2020. Concept Extraction Using Pointer-Generator Networks and Distant Supervision for Data Augmentation. In International Conference on Knowledge Engineering and Knowledge Management. Springer, Cham.

MSA The architecture of the proposed model.

Installation

Install into Anaconda Python Environment (recommended)
Step 1. Download and install Anaconda (Windows, Mac OS X, Linux)
Step 2. Create conda environment:

# Create a new conda environment
conda create -n ce-env python=3.6

# Activate the conda environment
source activate ce-env

Step 3. Ensure pip is up-to-date:

conda update pip

Step 4. Run setup.py to install necessary dependencies for Python and download pretrained models:

cd ConceptExtraction/
python setup.py

or run manually:

pip install -r requirements.txt
python download_models.py

Step 5. Install torch from https://pytorch.org/

Step 6. Clone and install submodule dependencies

git submodule update --init --recursive
cd OpenNMT-py/
python setup.py install
# Copy modified translate.py to OpenNMT-py root folder
cp ../translate.py ../OpenNMT-py/translate.py

Extract concepts

Run the extractor that uses pretrained models:

python run_concept_extraction.py -i static/example_text.txt -odir output
# In case you have an error with "torch.div", replace "torch.div" with "torch.floor_divide" in the file "OpenNMT-py/onmt/translate/beam_search.py", line 164

To check all available options, simply run:

python run_concept_extraction.py --help

The input file should contain one text by line (apparently, it might be a paragraph or a sentence by line).