/psy-topic-models

Topic modelling of a large psychology corpus with LDA

Primary LanguagePython

Psychology topic models

Requires: docker, node, python

1. Parsing, storing in db

in pubmed-to-db:

1.1 Install dependencies: npm install
1.2 Start database: docker-compose up
1.3 Create database table structure: node db-setup.js
1.4 Download xml-results from the source: https://www.ncbi.nlm.nih.gov/pubmed/?term=psychology
1.5 Parse file: node parse-from-xml.js <path-to-xml-file>

(xml-file refers to an export from pubmed)

2. Preprocessing

in topic-modelling:

2.1 Install modules pip install nltk spacy gensim
2.2 Download nltk stopwords: python download_stopwords.py
2.3 Download spacy en module: python -m spacy download en
2.4 Preprocess the texts: python preprocess.py
2.5 Transform corpus to dictionary and bag-of-words structure: python -c "from transform_corpus import *; save_corpus_and_dictionary_to_file()"

3. LDA

in topic-modelling:

3.1 Download Mallet, set MALLET_PATH environment variable
3.2 Generate a topic model: python generate_model.py <'gensim' | 'mallet'> ...topic_number_configurations
3.3 Calculate coherence scores: python get_coherence.py ...model_names
3.4 Extract topics to a csv file: python extract_topics.py <model-name> <number of topics>