PDF Text Extractor

PDF extractor used to generate text statistics of PDF files. Based on Apache PDFBox.

User Guide

Download or build the latest release of pdf-extractor-{version}.jar (the JAR)
Move the JAR to folder that is convenient to you
Prepare the (relative or absolute) paths for the following
1. The {keywords_file}, e.g. keywords/keywords.txt
  - Must be plain text of any extension
2. A {pdf_folder} that contains the PDF files to be extracted, e.g. pdf/
  - Only PDF files with a .pdf extension will be processed
3. A {output_file} path, e.g. output/output.xlsx
  - File name must end with .xlsx
Open Terminal or Command Prompt and navigate to the folder that contains the JAR
Run the JAR with the following command:
```
java -jar pdf-extractor-{version}.jar --keyword-file-path {keywords_file} --pdf-folder-path {pdf_folder} --output-file-path {output_file} --parallel --case-sensitive
```
- Mandatory flags
  - --keyword-file-path: path of {keywords_file}
  - --pdf-folder-path: path of {pdf_folder}
  - --output-file-path: path of {output_file}
- Optional (but important) flags
  - --parallel: enables parallel processing
    - if this flag is not set, the program uses sequential processing
  - --case-sensitive: enables case-sensitive matching
    - if this flag is not set, the program converts both the keywords and the extracted text to lower case before comparing

java -jar pdf-extractor-2.0.0.jar --keyword-file-path "keywords/keywords.txt" --pdf-folder-path "pdf/" --output-file-path "output/output.xlsx"