PDF extractor used to generate text statistics of PDF files. Based on Apache PDFBox.
-
Download or build the latest release of
pdf-extractor-{version}.jar
(the JAR) -
Move the JAR to folder that is convenient to you
-
Prepare the (relative or absolute) paths for the following
- The
{keywords_file}
, e.g.keywords/keywords.txt
- Must be plain text of any extension
- A
{pdf_folder}
that contains the PDF files to be extracted, e.g.pdf/
- Only PDF files with a
.pdf
extension will be processed
- Only PDF files with a
- A
{output_file}
path, e.g.output/output.xlsx
- File name must end with
.xlsx
- File name must end with
- The
-
Open Terminal or Command Prompt and navigate to the folder that contains the JAR
-
Run the JAR with the following command:
java -jar pdf-extractor-{version}.jar --keyword-file-path {keywords_file} --pdf-folder-path {pdf_folder} --output-file-path {output_file} --parallel --case-sensitive
- Mandatory flags
--keyword-file-path
: path of{keywords_file}
--pdf-folder-path
: path of{pdf_folder}
--output-file-path
: path of{output_file}
- Optional (but important) flags
--parallel
: enables parallel processing- if this flag is not set, the program uses sequential processing
--case-sensitive
: enables case-sensitive matching- if this flag is not set, the program converts both the keywords and the extracted text to lower case before comparing
- Mandatory flags
java -jar pdf-extractor-2.0.0.jar --keyword-file-path "keywords/keywords.txt" --pdf-folder-path "pdf/" --output-file-path "output/output.xlsx"
org.apache.commons.commons-lang3
org.apache.pdfbox.pdfbox
org.apache.poi.poi
org.apache.poi.poi-ooxml
org.javatuples.javatuples