/pdf_scraper

API to get text & text-without-stopwords from a given pdf

Primary LanguagePython

pdf-scraper

pdf-scraper is an open source API for :

  • extracting text from PDFs
  • extracting filtered-text from LinkedIn resumes

Demo

You can try the demo. Live till - Tuesday 24 November 2020

Installation for API (Ubuntu)

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
python3 app/main.py

Tasks

Code for each task is present in seperate folders Task 4 is the repo itself.

Task 1
  • Downloaded 50 LinkedIn resumes and added to task_1 folder.
Task 2
  • Used 'textract' library to get text from all the pdfs
  • output from 'textract' is converted to 'utf-8' to be used as string
  • String generated is added row-wise to the csv file
  • pdf_to_text.csv is generated by running the main.py script present in task_2 folder

Instructions to run

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
Make folder named 'profile_pdfs' in task_2 folder and add the PDFs in it
python3 task_2/main.py
Task 3
  • For Removing stop-words from a given text NLTK's stop-word list (Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query)
  • Some words which are not present in the NLTK's stop-word list but occur frequently in the resumes have to be added to NLTK's stop-word list
  • So by calucalating the count of unique words across 50 PDF's we can add the words which are present in 85% of the resumes to our new stop-word list
  • Then for each word present in the extracted text we can check whether it is present in our updated stop-word-list or not
  • If it is present then it is to be removed

Instructions to run

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
Make folder named 'profile_pdfs' in task_3 folder and add the PDFs in it
python3 task_3/main.py
Task 4
  • Using our functions from Task 2 & Task 3 we can make the required API
  • Running the flask app we are presented with two forms to get text from PDFs filtered text from linkedIn resumes (PDF)
  • The routes namely 'text' & 'text_without_stopwords' perform these tasks
  • The demo is hosted at pythonanywhere - pdf-scraper till Tuesday 24 November 2020

Installation for API (Ubuntu)

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
python3 app/main.py

Making Requests to API running on http://127.0.0.1:5000/

TYPE-1 extract-text-from-pdf
curl --location --request POST 'http://127.0.0.1:5000/text' \
--form 'path-to-pdf-file-on-your-device'
TYPE-2 extract-text-without-stopwords-from-pdf
curl --location --request POST 'http://127.0.0.1:5000/text_without_stopwords' \
--form 'path-to-pdf-file-on-your-device'