Authors:
- Diana Spahieva (developer)
- Ege Özol (product owner)
- Jannes Hollander (developer)
- Kyriakos Koukiadakis (developer)
- Livia Popper (Scrum master & developer)
- Niels Gaastra (developer)
You can see our pipeline in this picture:
Here you can find the Power BI dashboard.
Here is a link to a Google Drive folder where we have stored three models and the input data needed for the pipeline.
-
- Here, we split the content of the documents, based on their semantic meaning. In order to lemmatize the words, we used Spacy's "nl_core_news_lg" Dutch, extensive ("lg", meaning large) POS tags, to tokenize the words for better results.
- Input: "data.json"
- Output: "documents_split_into_paragraphs.csv"
-
actor_mapping_to_paragraphs.py
- For this module, we used the NER tags provided by Spacy's implementation by JoinSeven to map the actors and organizations in each paragraph. We also used the Abbreviations list provided by JoinSeven to map discrete actors that had been mentioned with slightly different names.
- Inputs:
- "org_abbreviations.json"
- "data.json"
- "family_names_in_the_netherlands_with_natural_name.csv"
- "documents_split_into_paragraphs.csv"
- Output: "paragraphs_split_actors_organizations.csv"
-
- In this module, we combine the dates of each document and paragraph, the actors and organizations, and the initial data from JoinSeven. Afterwards, we filter the paragraphs based on having at least one actor, so that the paragraphs are meaningful for our network.
- Inputs:
- "paragraphs_split_actors_organizations.csv"
- "data.json"
- Output: "actors_organizations_processed_data.csv"
-
- This is our main Topic modelling notebook. We use this to train our BERTopic models, using Spacy's "nl_core_news_lg" Dutch and extensive ("lg", meaning large) POS tags for lemmatization and tokenization. We also fine-tuned the light pre-trained sentence transformer model "all-MiniLM-L6-v2" and the 5 times bigger and best performing "all-mpnet-base-v2" model of Huggingface. There is also an available topic reduction algorithm that can reduce the topics according to the user's preferences.
- Input: "actors_organizations_processed_data.csv"
- Outputs:
- BERTopic model: model_nneighbors15_ncomponents5_cluster_size120_unigram
- "newDF_nneighbors15_ncomponents5_cluster_size120_unigram.csv"
-
- This module contains functions for preparing the output data of the topic modelling notebook for our bipartite visualization in the PowerBI dashboard.
- Input: "newDF_nneighbors15_ncomponents5_cluster_size120_unigram.csv"
- Outputs:
- "bipart_actors_per_par.csv"
- "bipart_organizations_per_par.csv"
-
bipartite_network_projection.py
- This module contains functions for preparing the output data of the topic modelling notebook for our projection of the bipartite network visualization in the PowerBI dashboard.
- Inputs:
- "bipart_actors_per_par.csv"
- "bipart_organizations_per_par.csv"
- Outputs:
- "projection_actors_per_par.csv"
- "projection_organizations_per_par.csv"
Side note: The whole pipeline took us approximately six and a half hours to run on our setting, namely a laptop with a CPU i7-11800H and a GPU RTX 3050 Ti. If you want to drastically improve the running time, you might consider using a lighter version of Spacy's lemmatizer, such as "nl_core_news_sm", for both paragraph_splitting.py, and Bertopic_modelling.ipynb. Furthermore, you might consider using an even lighter version of the pre-trained sentence transformer model "all-MiniLM-L6-v2" for Bertopic_modelling.ipynb, such as "paraphrase-albert-small-v2", which is available at https://www.sbert.net/docs/pretrained_models.html?highlight=sentencetransformer%20korea. However, these tweaks, are going to change, and probably deteriorate the performance of the Topic Models.