This is code for performing topic modeling on PubMed abstracts. Depends on package gensim and nltk. The package contains two parts: 1. Retrieve PubMed abstracts using the script getpmAbstracts.py Usage: usage:python getpmAbstracts.py -q [query] -o [output] -s [flag for steming] Options: -h, --help show this help message and exit -q QUERY, --query=QUERY Enter the PubMed query (PubMed style queries supported) -o OFILE, --output=OFILE Enter the output file name to store result -s, --stem To stem result 2. Once you have the PubMed abstracts in a file run LDA over them: Usage: python gensim_lda_pubmed.py -i [inputfile] -k [number of topics to extract] -v [verbose output FALSE by default] -t [TRUE/ FALSE for TFIDF weights] -r [return topics per document TRUE/FALSE (default FALSE)] Options: -h, --help show this help message and exit -i IFILE, --inputfile=IFILE Enter the file containing PubMed abstracts -k NTOP, --numtopics=NTOP Number of topics -t TFIDF, --tfidf=TFIDF TFIDF weignting (default TRUE) -v VERBOSE Verbose Output TRUE/FALSE (default FALSE) -r FIT Return topics per document TRUE/FALSE (default FALSE) All the code in this project is under Creative Commons License.