Unsupervised Keyphrase Extraction

This code support document length > 512.

requirements

We employ StanfordCoreNLP 4.5.1 to preprocess the data, you can download it here: https://stanfordnlp.github.io/CoreNLP/index.html.

Step 0: tokenize and tag the plain text (one example/line).

python  src/data_preprocess.py [data_path] [file_name]

Step 1: obtain embeddings of candidate phrases and the whole document.

python src/get_embedding.py --file_path [data_path] --file_name [file_name] --model_name [pretrained model name/path]

Step 2: extract keyphrases

python src/ranker.py [data_path] [model_name]