wordpress-blog-text-mining

Topic modeling and word cloud generation from WordPress blogposts

Python's wordpress_xmlrpc package suports extraction of WordPress site data, provided you have the url, username and password to the WP site.

Here I've extracted all my published wordpress blogposts from https://trustmeyourealive.com - 50 of them with a total of 32251 words (it isn't much but I'm hoping to extend this long term so will have more data as I publish more :-) ). There is also an option to extract your drafts in the GetPosts() call.

Modules in the repository and Python libraries needed -

Data extraction using wordpress_xmlrpc
Word clouds generation using wordcloud - nouns, verbs, adjectives tagged & extracted with TextBlob
Topic modeling using latent dirichlet allocation (gensim, nltk, spaCy) - 12 generated topics visualized using pyLDAvis (refer topicmodeling_vis.html)

pyLDAvis visualization - http://htmlpreview.github.io/?https://github.com/parvathysarat/wordpress-blog-text-mining/blob/master/topicmodeling_vis.html