CodeForPhilly/balancer-backup

PDF Extraction Added to text_extraction Endpoint

Closed this issue · 1 comments

PDF extraction was added to the text_extraction endpoint in the dev server. This brings a few changes that need to be made in the front end to implement the new version of text_extraction.

  1. The call is now multipart/form-data rather than application/json. If querying a webpage, the URL should be attached to request.form.url. Functionality should remain relatively the same otherwise for URLs.
  2. PDF files should be directly attached to request.files.pdf_file.
  3. Currently, only the first 3000 tokens of either the webpage or pdf are being processed. This is because the ChatGPT call can only process a total of 4000 tokens. This is shared between both the question and response. I am working on working around this limitation.
  4. If the webpage or PDF is not related bipolar disorder or bipolar medications, it should now respond with a message indicating that the webpage or PDF must be related to either of those topics.

Everything with this should be 👍🏻