Topic modeling and word cloud generation from WordPress blogposts
Python's wordpress_xmlrpc package suports extraction of WordPress site data, provided you have the url, username and password to the WP site.
Here I've extracted all my published wordpress blogposts from https://trustmeyourealive.com - 50 of them with a total of 32251 words (it isn't much but I'm hoping to extend this long term so will have more data as I publish more :-) ). There is also an option to extract your drafts in the GetPosts() call.
Modules in the repository and Python libraries needed -
- Data extraction using wordpress_xmlrpc
- Word clouds generation using wordcloud - nouns, verbs, adjectives tagged & extracted with TextBlob
- Topic modeling using latent dirichlet allocation (gensim, nltk, spaCy) - 12 generated topics visualized using pyLDAvis (refer topicmodeling_vis.html)
pyLDAvis visualization - http://htmlpreview.github.io/?https://github.com/parvathysarat/wordpress-blog-text-mining/blob/master/topicmodeling_vis.html