
convert PDF file to text file

Primary LanguagePHP


Docker for creating a PDF parsing server

Local Setup

  1. git clone git@github.com:WriterDuetTeam/pdf-parser.git
  2. cd pdf-parser
  3. npm install

Run and test locally OUTSIDE of Docker

  1. npm run dev
  2. Visit http://localhost:8080/test
  3. Choose a .pdf file.
  4. Wait and see the response from POST /convert_script

Having issues? See initial notes from Guy below.

Run and test locally INSIDE of Docker

  1. npm run docker:dev
  2. Visit http://localhost:8080/test
  3. Choose a .pdf file.
  4. Wait and see the response from POST /convert_script

Test against a test GCR instance

  1. Run npm run dev.
  2. Visit http://localhost:8080/test
  3. Update the Api Url to https://pdf-parser-ki7n3lc5eq-wn.a.run.app. TODO remove this!
  4. Choose a .pdf file.
  5. Wait and see the response from POST /convert_script

PDF Conversion Flow

  1. User goes to File / Import and selects to import a .pdf file.
  2. WD sends POST request to GCR with form-data...
ext: pdf
agree_tou: true
format: fdx
download: true
script: (binary)

Full curl example below.

curl 'http://localhost:8080/convert_script' \
  -H 'authority: v3.writerduet.com' \
  -H 'sec-ch-ua: "Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"' \
  -H 'accept: */*' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'content-type: multipart/form-data; boundary=----WebKitFormBoundaryUNOCBGPRhGsVNzti' \
  -H 'origin: https://www.writerduet.com' \
  -H 'sec-fetch-site: same-site' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://www.writerduet.com/' \
  -H 'accept-language: en-US,en;q=0.9' \
  --data-raw $'------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="ext"\r\n\r\npdf\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="agree_tou"\r\n\r\ntrue\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="format"\r\n\r\nfdx\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="download"\r\n\r\ntrue\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti\r\nContent-Disposition: form-data; name="script"; filename="file.pdf"\r\nContent-Type: application/octet-stream\r\n\r\n\r\n------WebKitFormBoundaryUNOCBGPRhGsVNzti--\r\n' \

  1. GCR spins up container(s) using our Dockerfile
  2. Containerized express app receives the request
  3. Controller calls PHP Parser to convert the .pdf file to desired format
  4. Controller/app responds with the properly formatted version of the document
  5. WD receives document, processes
  6. User in WD sees their initially pdf’d document as a WD editable document


Open items and TODO

Prioritized TODO list

  • Make an example pdf with every standard type of format
  • Allow for more than one output file at a time and remove after sending
  • Establish ability to convert pdf to WD JSON format
    • Make title page have perfectly accurate lines Add all lines from first page to titlePage JSON
    • Dual Dialogue Parser should handle, out of scope
    • Random real scripts from WD
  • Set up GCR simulator (if not more than a quick local thing)
  • Deploy a test GCR container
  • Minimal logging
  • Controller validations on POST body (WD client, file size, pdf format)
  • Tests
  • Load test a single instance (could have up to 80? concurrent requests)

Random ideas

  • test parser like a black box, with expected output files to compare against
  • test node stuff
  • Set up GCR simulator
  • Deploy a test container
  • Performance testing a single container
  • Handle multiple output formats

Limitations / issues

Open Questions

Initial Docs from Guy

I finally got together a simplified version of the PHP code which converts PDFs into objects (and optionally outputs as Fountain). It had code to write some other formats too, but none of that is really used anymore, and the .fdx generation is especially uninteresting since it just used some old 3rd party application to convert from .fountain to .fdx (so it’s just playing telephone).

To run it, you’ll need to install pdftohtml first, which I did via

brew install poppler

That puts it in /usr/local/bin/pdftohtml though that could be different for you.

Then to execute the php code from the CLI, the following command should work (replacing the path of pdftohtml, or saying just pdftohtml=pdftohtml if that finds it for you)

php analyzer/TestParser.php --fountain pdftohtml=/usr/local/bin/pdftohtml  PATH_TO_PDF.pdf

That command generates a Fountain file output.fountain . The more interesting part is to just dump the screenplay blocks it’s generating (it has multiple dump points, to show how it’s thinking along the way, but you can modify it to just print at the end with the print_r($parser->get_objects()); code

php analyzer/TestParser.php -X1707 pdftohtml=pdftohtml PATH_TO_PDF.pdf

Take a look at WD’s File > Export option for JSON - I think we’re going to want to use that syntax as the intermediate format the backend app generates directly from the get_objects() result, since it’s the most precise for what we’re doing.

But for testing, starting with Fountain is definitely fine.

And BTW, you’re welcome to deploy to GCR for testing, but that’s a relatively minor detail that we can do at any point. Just getting the Docker ready such that it could be deployed is the main process, it’s totally fine if you’re just testing it locally or want to serve the Docker image from another service if something else is easier for you to use, since Docker should behave the same anywhere it’s deployed

