My work during an internship at UCD Centre for Digital Policy in 2022.
- Download the policy documents from the shared Google Drive, put them in
data/nat-ai/orig
- See list of documents here: https://docs.google.com/spreadsheets/d/1e6nCWAKRSAo3cq4O-3WUKtFp5AR7up_cr8hY2jI12Zg/edit?usp=sharing
- Run
./pdf-to-txt.sh
- Text files should then be populated in
data/nat-ai/text
- stopwordsiso - for stopword list
- NLTK - for lemmatizer
- scikit-learn - for document classifier (using Latent Dirichlet Allocation - LDA)
- pyLDAvis - for visualization
- Apache PDFBox 3 is required for text extraction from PDF.
- Download from https://pdfbox.apache.org/download.html
- Rename the jar file to
pdfbox-app-3.jar
and put it insidelib/
directory - Apache PDFBox 3 is licensed under the Apache License, Version 2.0