Project Name: A Simple Text Mining Tool for Analyzing Research Paper Abstracts Description: This project is a text mining tool using search results from National Center for Biotechnology Information's database (http://www.ncbi.nlm.nih.gov/pubmed). It uses Perl and Python for text processing and statistic analysis. Modules and files (not all): pubmed_result.txt -- results downloaded from PubMed export.csv -- results exported from IEEE Xplore raw_data.json -- results selected from pyspider sqlite database keywords.txt -- keywords provided by user preProcess.pl -- take pubmed_result.txt as input; make it easy for later process csvParser.py -- take export.csv as input; make data into json format and append it to raw_data.json; only support IEEE Xplore format jsonParser.pl -- take raw_data.json as input; make data into the same format as preProcess.pl splitFunction.pl -- core split function; shared by raw data parsers myFormat.txt -- generated by preProcess.pl jsonParser.pl append data after preProcess.pl stem.pl -- take myFormat.txt as input; stem each word in every sentence stemDict.txt -- stemmed words and their corresponding original words generated by stem.pl stemmedSentence.txt -- stemmed words in sentences; generated by stem.pl selectSentence.pl -- take stemmedSentence.txt as input; take stemKeyword.pl as sub-module; handle all stemmed sentences and select those contains given keywords; if no keywords is provided, take myFormat.txt as result instead. stemKeyword.pl -- take keywords.txt as input; stem the keywords selectSentence.pl's subsidiary module stemFunction.pl -- core stem function; Porter stemmer pmidList.txt -- pmid list file; generated by selectSentence.pl dict.py -- take stemDict.txt as input; eliminate stop words and proceed simple statistic stats_words.txt -- stemmed words and their frequencies; generate by dict.py htmlGenerator.py -- use pmidList.txt to generate a simple webpage for easy database access PMIDList.html -- simple webpage contains titles and URLs nextStep.py -- access original raw data; extract the articles' original entries listed in pmidList.txt; keep using original format: MEDLINE or json HOWTO: = Generate Raw Data ==== pubmed_result.txt 1. Make a search on http://www.ncbi.nlm.nih.gov/pubmed. 2. Press "Send to" on the right top of page and select "File" & "MEDLINE". Press "Create File" 3. Put this file "pubmed_result.txt" into the same directory as these codes. ==== raw_data.json 1. Go to pyspider's data folder 2. Type "sqlite3 result.db" in the command line 3. Type ".output raw_data.json", "select result from resultdb_YourProjectName;" (You may want to type ".table" to check current resultdb table before selecting) 4. Type ".quit" and copy "raw_data.json" into the same directory as these codes. ==== export.csv 1. Make a search on http://ieeexplore.ieee.org/Xplore/home.jsp 2. Press "Export to CSV" on the right of "Download Citations". 3. Change the file name as "export.csv" and sive it into the same directory as these codes. = Use This Tool 1. Type make<RETURN> in the command line; this may take several minutes, which depends on the size of raw data 2. Type make<SPACE>html<RETURN> in the command line to generate PMIDList.html. Use any web browser to open PMIDList.html for easy access 3. Type make<SPACE>next<RETURN> in the command line to backup current data and generate new raw data for a new round of Make 4. Change keywords in keywords.txt and goto step 1 5. Revert manually according to the time stamp in backup folder if it is needed = Installation (Ubuntu as example): #install perl, python and make. #you can install build-essential too. $sudo apt-get install perl python make #install CPAN for perl modules $sudo perl -MCPAN -e shell #press <RETURN> until the installation is finished $sudo cpan cpan[1]> install Lingua:EN:Sentence cpan[2]> install Unicode:Normalize #quit cpan shell cpan[3]> exit #DONE #install sqlite command line tool $sudo apt-get install sqlite3 Detials about pyspider please see: https://github.com/binux/pyspider LICENSE: See LICENSE
wsheffel/PubMed-Text-Mining-Tool
A Simple Text Mining Tool for Analyzing Research Paper Abstracts
PerlGPL-2.0