Digital libraries are innovative technologies for knowledge sharing. Endless quantities of documents/information are accessed through them. However, the efficiency of documents/information search systems and their ability to identify the desired and related information are not keeping pace with the ever-increasing volume of stored data. Here we present Network TD-SOM, a systematic process that offers a practical method of searching, visualising, organising, discovering related documents, and extracting knowledge from a vast corpus. It combines different topic modelling algorithms implemented separately with a two-level hybrid clustering approach and network analysis. In each technique, we exploit the results to design various kinds of spatialisation dedicated to different purposes. For example, some allow the uncovering of the thematic structure distribution over time and profiling of the clusters in a corpus. The main visualisation is an interactive corpus network that supports exploration, browsing, navigation, and zoom. Additionally, it allows incorporating the main results of each technique used and the useful metadata information into the visualisation. We evaluated the Network TD-SOM performance on the Master’s theses dataset from NOVA IMS. LDA and BERTopic successfully uncovered the thematic structure and extracted helpful knowledge of the dataset. However, BERTopic demonstrated superiority as a solution. The features/topics extracted from BERTopic also leverage the cluster results compared with the features from LDA. The arrangement of the two network theses had similarities with the cluster results. However, the one modelled by using features/topics from LDA was better than the BERTopic.
Corpus; Visualisation; Topic modelling; Clustering; Network
Coming soon!!
We proposed a Network TD-SOM, which means Network and Self-organised maps of the topic documents. The approach sets a systematic process (Figure below) to cluster and represent the relatedness of the documents in a corpus. Additionally, it allows extracting helpful knowledge or uncovering the thematic structure in the corpus without needing to read all the documents. Network TD-SOM combines topic modelling algorithms, cluster algorithms and network analysis. Network TD-SOM is composed of 8 main steps: document collection, text descriptive and pre-processing, transformation, topic model training and evaluation, topic model interpretation, clustering documents, Network analysis and visualisation.
The theses from NOVA IMS Master programs can be considered as: a dissertation, an internship report or a work project. The dissertation has been the most preferred choice by students about to complete their masters.
Figure below shows the distribution of the word numbers in the abstracts related to each thesis and the annual average of the word numbers. Overall, the number of words follows a normal distribution. Most abstracts have a total number of words within the expected range (around 80-250).
Figure below illustrates the most popular 200 words in the abstracts. Overall, "Model", "data", "result", "area", "analysis", "system", "based", "project", "information", "study", "land_cover", "algorithm", "management" and "application knowledge" are some examples of words that are more frequently mentioned in the abstracts of the Masters' theses from NOVA IMS.
Figure below allows uncovering the thematic structure of the Master’s theses dataset and shows the average prevalence of the topics for four different courses/specialisations in a given year. Overall, in each course/specialisation, there is a good match with the respective most dominant topic. This indicates that LDA has correctly discovered the latent topics in an unsupervised way.
Links below provides information about other courses/specialisations. In each course/specialisation, there is good matching with the respective most dominant topic.:
https://github.com/VMunhangane/NETWORK-TD-SOM-Master-thesis/blob/main/analysis/distribution%20of%20weight%20topics%20by%20courses%20specialisations%20per%20year%20(lda)_graph_2.pdf https://github.com/VMunhangane/NETWORK-TD-SOM-Master-thesis/blob/main/analysis/distribution%20of%20weight%20topics%20by%20courses%20specialisations%20per%20year%20(lda)_graph_3.pdf
Figure below shows the average prevalence of the topics for four different courses/specialisations in a given year. Overall, it is possible to see that the courses/specialisations and their dominant topics match. Additionally, the top 25 words are semantically linked with the names of the courses/specialisations where they are dominant. This indicates that BERTopic has correctly and semantically discovered the latent topics in an unsupervised way.
Links below provides information about other courses/specialisations. In each course/specialisation, there is good matching with the respective most dominant topic.:
https://github.com/VMunhangane/NETWORK-TD-SOM-Master-thesis/blob/main/analysis/distribution%20of%20weight%20topics%20by%20courses%20specialisations%20per%20year%20(bertopic)_graph_2.pdf https://github.com/VMunhangane/NETWORK-TD-SOM-Master-thesis/blob/main/analysis/distribution%20of%20weight%20topics%20by%20courses%20specialisations%20per%20year%20(bertopic)_graph_3.pdf
A hybrid clustering approach was used to find the clusters of the theses. In the first stage, SOM was applied. The best match units (BMUs) found in the first stage, were used in the second stage. Ward was applied on the BMUs. After analysing the dendrogram, the solution of 6 and 5 clusters using the topic vectors/features from LDA and BERTopic were selected, respectively. Figure below shows how the clusters are organised using features from each topic modelling algorithm.
##### SOM heatmaps from bertopic topic vectorsFigure bellow shows the interlinkages of the master theses from NOVA IMS modelled by using the LDA topic vectors. The arrangement of the network has a good similarity with cluster results.
Link to access the interactive visualisation: https://vmunhangane.github.io/Thesis_vis/network/
Figure below shows the interlinkages of the master theses from NOVA IMS modelled by using the BERTopic topic vectors. Overall, the network is dense, and some nodes are overlapping.
Link to access the interactive visualisation: https://vmunhangane.github.io/Theses_bertopic/network/