Enhanced mechanisms to TestGrammar for processing bulk PDFs
petervwyatt opened this issue · 2 comments
Request to provide enhanced mechanisms to the C++ PoC app for processing folders of PDFs:
-
list of PDF files to exclude when using
--pdf
recursive folder processing: maybe as--exclude [ @filelist.txt | string ]
. Simple start-of-string match is done of the current full path of the PDF vs. string or each line in the specified text file. If a match is found, then file is skipped. This allowsstring
and@filelist.txt
to be directory names or filenames. -
explicit list of PDF files to test (replace
--pdf
recursive folder processing): maybe have the string argument following--pdf
start with@
so--pdf @filelist.txt
.filelist.txt
is then one PDF path/filename per line in platform-appropriate syntax (as though output from Linuxfind
command or Windows CMDdir /S/B
)
The explicit list of files to test should also be able to specify a password per file after filename. Because filenames can contain spaces and want to avoid the need to quote everything, use a special character as a separator (@
again?)
- Added
--dryrun
also for testing - does everything except process a PDF (it does create.ansi
or.txt
files of zero length)
- Also need an
--allfiles
option to ignore file extensions and processing every file regardless (e.g. SafeDocs/JPL CommonCrawl repo)