pdf-association/arlington-pdf-model

Enhanced mechanisms to TestGrammar for processing bulk PDFs

petervwyatt opened this issue · 2 comments

Request to provide enhanced mechanisms to the C++ PoC app for processing folders of PDFs:

  • list of PDF files to exclude when using --pdf recursive folder processing: maybe as --exclude [ @filelist.txt | string ]. Simple start-of-string match is done of the current full path of the PDF vs. string or each line in the specified text file. If a match is found, then file is skipped. This allows string and @filelist.txt to be directory names or filenames.

  • explicit list of PDF files to test (replace --pdf recursive folder processing): maybe have the string argument following --pdf start with @ so --pdf @filelist.txt. filelist.txt is then one PDF path/filename per line in platform-appropriate syntax (as though output from Linux find command or Windows CMD dir /S/B)

The explicit list of files to test should also be able to specify a password per file after filename. Because filenames can contain spaces and want to avoid the need to quote everything, use a special character as a separator (@ again?)

  • Added --dryrun also for testing - does everything except process a PDF (it does create .ansi or .txt files of zero length)
  • Also need an --allfiles option to ignore file extensions and processing every file regardless (e.g. SafeDocs/JPL CommonCrawl repo)