This code is an NLP pipeline for topic modeling large collections of documents. It is generalizeable to any text data set, once formatted properly (see src-gen/raw_corpus2tsv.py
for an example processing script).
This code is run on a compute cluster at Brown University using the slurm scheduler. Minor adjustments should allow the code to be run on other compute clusters or locally.
- Install software/packages in the Requirements section, change paths, comment out compute cluster specific code (e.g.
os.system('module load *')
) - Toggle switches for desired pipeline steps in
main.py
- Run
python main.sh
- Set up compute cluster with software/packages in the Requirements section and change any paths or compute cluster specific code (e.g.
os.system('module load *')
) - Toggle switches for desired pipeline steps in
main.py
- Adjust resources and run
sbatch sbatch.sh
data/
: contains tabular data output from all steps in the pipeline and two additional folderssentiment/
: contains positively and negatively charged adjective lists (not used in this pipeline)stoplists/
: contains generic and custom stopword lists
src/
: contains the codesbatch.sh
: script to run the pipeline on a compute cluster in batch modemain.py
: top level script with switches to control other scripts/parts in the pipelineraw_corpus2tsv.py
: custom hansard script to transform raw xml data to tabular datapreprocess.py
: both custom hansard cleaning and generic data cleaning functionsmallet_import_from_file.sh
: mallet command to import data to mallet format on a compute clustermallet_train_lda.sh
: mallet command to train LDA model on mallet format data on a compute cluster
test/
: contains 10 sample Hansard xml data
-
Python 3.6.1
-
anaconda 3-5.2.0
-
MALLET
-
numpy
-
pandas
-
sys
-
os
-
time
-
string
-
csv
-
pickle
-
sklearn
-
nltk
-
enchant