pdf-parser

Docker for creating a PDF parsing server

Local Setup

git clone git@github.com:WriterDuetTeam/pdf-parser.git
cd pdf-parser
npm install

Run and test locally OUTSIDE of Docker

npm run dev
Visit http://localhost:8080/test
Choose a .pdf file.
Wait and see the response from POST /convert_script

Having issues? See initial notes from Guy below.

Run and test locally INSIDE of Docker

npm run docker:dev
Visit http://localhost:8080/test
Choose a .pdf file.
Wait and see the response from POST /convert_script

Test against a test GCR instance

Run npm run dev.
Visit http://localhost:8080/test
Update the Api Url to https://pdf-parser-ki7n3lc5eq-wn.a.run.app. TODO remove this!
Choose a .pdf file.
Wait and see the response from POST /convert_script

PDF Conversion Flow

User goes to File / Import and selects to import a .pdf file.
WD sends POST request to GCR with form-data...

ext: pdf
agree_tou: true
format: fdx
download: true
script: (binary)

Full curl example below.

curl 'http://localhost:8080/convert_script' \
  -H 'authority: v3.writerduet.com' \
  -H 'sec-ch-ua: "Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"' \
  -H 'accept: */*' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'content-type: multipart/form-data; boundary=----WebKitFormBoundaryUNOCBGPRhGsVNzti' \
  -H 'origin: https://www.writerduet.com' \
  -H 'sec-fetch-site: same-site' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://www.writerduet.com/' \
  -H 'accept-language: en-US,en;q=0.9' \
  --data-raw $'------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="ext"\r\n\r\npdf\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="agree_tou"\r\n\r\ntrue\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="format"\r\n\r\nfdx\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="download"\r\n\r\ntrue\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="script"; filename="file.pdf"\r\nContent-Type: application/octet-stream\r\n\r\n\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti--\r\n' \
  --compressed

GCR spins up container(s) using our Dockerfile
Containerized express app receives the request
Controller calls PHP Parser to convert the .pdf file to desired format
Controller/app responds with the properly formatted version of the document
WD receives document, processes
User in WD sees their initially pdf’d document as a WD editable document

Open items and TODO

Prioritized TODO list

Random ideas

test parser like a black box, with expected output files to compare against
test node stuff
Set up GCR simulator
Deploy a test container
Performance testing a single container
Handle multiple output formats

Limitations / issues

Open Questions

Initial Docs from Guy

I finally got together a simplified version of the PHP code which converts PDFs into objects (and optionally outputs as Fountain). It had code to write some other formats too, but none of that is really used anymore, and the .fdx generation is especially uninteresting since it just used some old 3rd party application to convert from .fountain to .fdx (so it’s just playing telephone).

To run it, you’ll need to install pdftohtml first, which I did via

brew install poppler

That puts it in /usr/local/bin/pdftohtml though that could be different for you.

Then to execute the php code from the CLI, the following command should work (replacing the path of pdftohtml, or saying just pdftohtml=pdftohtml if that finds it for you)

php analyzer/TestParser.php --fountain pdftohtml=/usr/local/bin/pdftohtml  PATH_TO_PDF.pdf

That command generates a Fountain file output.fountain . The more interesting part is to just dump the screenplay blocks it’s generating (it has multiple dump points, to show how it’s thinking along the way, but you can modify it to just print at the end with the print_r($parser->get_objects()); code

php analyzer/TestParser.php -X1707 pdftohtml=pdftohtml PATH_TO_PDF.pdf

Take a look at WD’s File > Export option for JSON - I think we’re going to want to use that syntax as the intermediate format the backend app generates directly from the get_objects() result, since it’s the most precise for what we’re doing.

But for testing, starting with Fountain is definitely fine.

And BTW, you’re welcome to deploy to GCR for testing, but that’s a relatively minor detail that we can do at any point. Just getting the Docker ready such that it could be deployed is the main process, it’s totally fine if you’re just testing it locally or want to serve the Docker image from another service if something else is easier for you to use, since Docker should behave the same anywhere it’s deployed

gotoenchanter725/PDF-parser-TS