Topic modelling with Spacy, Gensim and Textacy

The jupyter notebook 'topic-modelling.ipynb' contains the following sections:

Initialize: Setting up environment and loading data.
Text extraction. Phrase and tokens extraction with Gensim and Spacy.
Topic modelling. Using Textacy's LDA model.
Data processing. Calculating data for visualization and export.
Model evaluation. A collection of visualizations of the resulting topics.
Export data. The data can be used for creating more visualization or import into a graph.

General concept

The emphasis in this notebook is on facilitating an iterative process where you can easily adjust stopwords and number of topics. Furthermore it contains features to re-focus on sub topics and thereby create a hierachy of topics.

Input

'data-in/tb_data.tsv' contains ~2100 scientific articles with the following properties: doi/title/abstract/keywords.

Output

Start by looking at the notebook: "topic-modelling.ipynb". Somewhere down the file you will find the 'visualization' section that gives an overview of the modelling data.

Most of the other files in the output data directory (data-out/) is exported to be used as input in other projects. If you are interested in understanding the modelled topics more in detail you may look at 'tb_main_doc-top.html' output directory which contains a list the 15 most relevant articles for each topic.

Caveat

Topic modelling using LDA is an stochastic algorithm which will produce (slightly) different results even when run on the same data. The exact same results can therefore not be reproduced.

shaomanlee/topic-model

Topic modelling with Spacy, Gensim and Textacy

General concept

Input

Output

Caveat

Inspiration