Categorize PDF documents with one-word tags and display the number of projects devoted to different causes via country-specific, weighted wordclouds.
- libnabo (no license)
- ocrmypdf (GNU general public license)
- coreNLP (GNU general public license)
- Swedish Python Routines (GNU general public license)
pip install virtualenv
virtualenv venv
osource venv/bin/activate (or venv/Scripts/activate.bat if on Windows)
- API calls from d-portal.org for SIDA's activities
- Jupyter Lab and Jupyter Notebook (running scripts online)
- ocrmypdf (running OCR on PDFs)
- coreNLP (language processing for English documents)
- Swedish Python Routines (language processing for Swedish documents)
- libnado (running k-nearest neighbor algorithm)
- Get IATI-identifiers of activities with known recipient country from d-portal.org
- Make API calls to get JSON file containing details tied with the activity
- Filter for completed activities using the activity status code
- Obtain and filter url to PDF results documents tied to the activities using the document format and report format code
- Determine if the language of the document is English or Swedish
- Run NLP on the document to do word stemming and remove stop words
- Filter for pre-defined keywords
- Use the k-nearest neighbor algorithm to calculate the weight of keywords
- Create a JSON file tying a country to its weighted key words
- Fetch JSON file on webpage to display results on webpage
- Python3
- Bash ShellScript
- HTML
- CSS
- JavaScript
- Christoffer Klang
- Eric Kuan
- Mikael Zwahlen
- Sharon Yeo