iPython notebooks to accompany Chapter 5 of upcoming text book: Text Mining and Visualization: Case Studies Using Open-Source Tools
Slides for PyData Berlin Conference May 2015 here
Video of PyData Berlin Conference talk here
Slides for Python Ireland Meetup November 2015 here
Installation instructions where Anaconda and Git are installed on your machine.
- git clone https://github.com/iBrianCarter/pillreports_python
- cd pillreports_python
Downloads all the relevant packages, this can be done manually also.
- conda create -n pillreports_python --file package-list
- activate pillreports_python
- jupyter notebook
When you are complete don't forget to deactivate the environment in command line
- deactivate
- conda env remove -n pillreports_python
List of the library versions used in development. Must have library version equal or greater for notebooks to function as described.
Initial scrape of data from website. Data is stored to a collection in a MongoDB. Must have instance of Mongo running. MongoDB can be downloaded from http://mongodb.org/downloads . Collection name is created in the format pillreports_%d%b%y% e.g (pillreports_31Mar15)
Must have instance of MongoDB running. Data read from MongoDB collection. Specify name as above. Data is saved to .csv file (prReports.csv) . Remaining notebooks read from .csv file. Initial scrape of 5001 records is provided in Data/prReports.csv for replicating research presented.
Data read from prReports.csv file - data exploration provided.
Data read from prReports.csv file. Naive Bayes and Stochastic Gradient Descent learning applied to predict Warning: field. Varying methods of vectorization applied to generate input features from Description: field.
Cluster and PCA applied to vector representation of the User Report: field. Scatterplot and wordcloud visualisastion of results.
Quick implentation showing LDA Topic Models with gensim and visualisation with pyLDAvis.
Appendix file illustrating the creation of sparse matrix representation of text data.
Appendix file illustrating various methods of creating subplots with Matplotlib library.