/PubMedLDA

Making it easy to perform LDA on PubMed abstracts.

Primary LanguagePython

This is code for performing topic modeling on PubMed abstracts. 
Depends on package gensim and nltk.
The package contains two parts:
1. Retrieve PubMed abstracts using the script getpmAbstracts.py
Usage: usage:python getpmAbstracts.py -q [query] -o [output] -s [flag for steming]

Options:
  -h, --help            show this help message and exit
  -q QUERY, --query=QUERY
                        Enter the PubMed query (PubMed style queries
                        supported)
  -o OFILE, --output=OFILE
                        Enter the output file name to store result
  -s, --stem            To stem result
2. Once you have the PubMed abstracts in a file run LDA over them:

Usage: python gensim_lda_pubmed.py -i [inputfile] -k [number of topics to extract] -v [verbose output FALSE by default] -t [TRUE/ FALSE for TFIDF weights] -r [return topics per document TRUE/FALSE (default FALSE)]

Options:
  -h, --help            show this help message and exit
  -i IFILE, --inputfile=IFILE
                        Enter the file containing PubMed abstracts
  -k NTOP, --numtopics=NTOP
                        Number of topics
  -t TFIDF, --tfidf=TFIDF
                        TFIDF weignting (default TRUE)
  -v VERBOSE            Verbose Output TRUE/FALSE (default FALSE)
  -r FIT                Return topics per document TRUE/FALSE (default FALSE)

All the code in this project is under Creative Commons License.