Explorations in word coherence for topic discovery
Example commands for generating MALLET files
bin/mallet import-file --input ~/coherence/docs/sotu/sotu_1000_cleaned.txt --keep-sequence --output sotu-1000.mallet
bin/mallet train-topics --input sotu-1000.mallet --num-topics 10 --random-seed 42 --num-iterations 1000 --word-topic-counts-file word-topic-counts-sotu-1000.txt
Currently, word pairs are duplicated in construction of PMI table i.e. pmi_dict[wi][wj] = pmi_dict[wj][wi] ==> Should we keep it this way?
For any metric that involves averaging over word pairs, there is the possibility that a word pair has no entry in the PMI lookup table. Currently, if that is the case, the score is considered to be zero. ==> Possible solution: truncate all negative PMIs to zero
Brendan:
Figure out why vocabulary of PMI-based clusters and MALLET clusters are slightly off
- compare vocabulary
Average coherence score over topics
Make comparison table of SOTU PMI and Wiki PMI for word pairs
- also make scatter plot of (SOTU PMI, Wiki PMI)
Check how often you get no co-occurence counts when calculating PMI for both training and evaluation of coherence
Statistical analysis of cluster coherences for different LDA runs
- variance
Implement greedy cluster over most frequent words (heuristic method from Percy Liang's thesis)
Try approximations for disjunction
Add notes about data structure of JSON files
Put counts next to words when printing clusters
Save counts with clusters
Infer number of clusters from MALLET file
Add score option to evaluate_coherence.py
Re-write code to process counts at the same time as documents