pdf-scraper

pdf-scraper is an open source API for :

extracting text from PDFs
extracting filtered-text from LinkedIn resumes

Demo

You can try the demo. Live till - Tuesday 24 November 2020

Installation for API (Ubuntu)

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
python3 app/main.py

Tasks

Code for each task is present in seperate folders Task 4 is the repo itself.

Task 1

Downloaded 50 LinkedIn resumes and added to task_1 folder.

Task 2

Used 'textract' library to get text from all the pdfs
output from 'textract' is converted to 'utf-8' to be used as string
String generated is added row-wise to the csv file
pdf_to_text.csv is generated by running the main.py script present in task_2 folder

Instructions to run

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
Make folder named 'profile_pdfs' in task_2 folder and add the PDFs in it
python3 task_2/main.py

Task 3

For Removing stop-words from a given text NLTK's stop-word list (Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query)
Some words which are not present in the NLTK's stop-word list but occur frequently in the resumes have to be added to NLTK's stop-word list
So by calucalating the count of unique words across 50 PDF's we can add the words which are present in 85% of the resumes to our new stop-word list
Then for each word present in the extracted text we can check whether it is present in our updated stop-word-list or not
If it is present then it is to be removed

Instructions to run

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
Make folder named 'profile_pdfs' in task_3 folder and add the PDFs in it
python3 task_3/main.py

Task 4

Using our functions from Task 2 & Task 3 we can make the required API
Running the flask app we are presented with two forms to get text from PDFs filtered text from linkedIn resumes (PDF)
The routes namely 'text' & 'text_without_stopwords' perform these tasks
The demo is hosted at pythonanywhere - pdf-scraper till Tuesday 24 November 2020

Installation for API (Ubuntu)

git clone https://github.com/aashutoshPanda/pdf_scraper.git
cd pdf_scraper
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
pip install -r requirements.txt
python3 app/main.py

Making Requests to API running on http://127.0.0.1:5000/

TYPE-1 extract-text-from-pdf

curl --location --request POST 'http://127.0.0.1:5000/text' \
--form 'path-to-pdf-file-on-your-device'

TYPE-2 extract-text-without-stopwords-from-pdf

curl --location --request POST 'http://127.0.0.1:5000/text_without_stopwords' \
--form 'path-to-pdf-file-on-your-device'

pratik0204/pdf_scraper

pdf-scraper

Demo

Installation for API (Ubuntu)

Tasks

Task 1

Task 2

Task 3

Task 4

Installation for API (Ubuntu)

TYPE-1 extract-text-from-pdf

TYPE-2 extract-text-without-stopwords-from-pdf