PDFKeyphraseExtract

##Requirements You will require the following conditions to run the program:

JVM installed on the machine
CLI tool to execute the jar

##For End Users The built standalone jar file is able to extract bibliography items from a specified PDF document or a specified list. You can perform this task by typing the following command into a CLI program:

java -jar [name].jar [input files] [output folder]

where you replace

name with the name of the standalone jar file
input files with the paths to the list of files you wish to extract keywords from, each separated with a space. Such as "C:/users/file1.pdf C:/users/file2.pdf"
output folder with the path to the directory you wish to store the output

It will store the extracted keywords in a file with the name of the corresponding document.

##For Developers This code functions by using the KEA Keyphrase Extraction Algorithm tool. It has modified the KEA tool to remove the model building component, and to change the type of input it works with. We have also changed the SKOS file to use ACMs digital computing classification SKOS file, and this can be interchanged with any other correctly formatted SKOS file from different domains depending on the type of text content is being processed. This tool works by reading each word from the text content from the input files, compares it against a provided SKOS file, and extracts the most relevant keyphrases.

The source code is explained below:

###Package - main Responsible for starting and running the keyphrase extractor classes.

Main class:

entry point to the program
parses the parameters provided
provides help messages if input parameters are incorrect
creates a local version of the resources required (stopwords, model, SKOS file)
starts and handles all of the tasks running in background

KeyphraseExtractorWorker class:

processes each pdf input
extracts keyphrases from input
prints extracted keyphrases into an output file, and also to console (which is read by GUI application)

KeyphraseExtractor class:

handles keyphrase extraction logic
extracts keyphrases based on provided model and SKOS file, ignoring any words in the stopword file
takes user specified fields as input (model name, file name, etc.) for extraction process
prints extracted keyphrases into an output file, and also to console (which is read by GUI application)

###Package - wrapper Contains modification of KEA's Stopwords class to instead use the provided stopwords file in the resources folder.

###Package - resources Contains all the external resources required for keyphrase extraction

compvoc.rdf

SKOS file containing hierarchy of keywords from the ACM Digital Computing Classification system
keyphrases are extracted by comparing words from the text with keywords in the SKOS file
This file can be replaced with other SKOS files as long as they have the correct format

model

this tool extracts keyphrases based on this model which contains trained data from a large number of publications

stopwords.txt

file containing all the words to be ignored during extraction (e.g. and, the, they, etc.)

blee660/PDFKeyphraseExtract

PDFKeyphraseExtract