The proliferation of fake and manipulated news poses a significant threat to the integrity of information dissemination, particularly within the context of Russian media. This paper underscores the critical need for robust assessment frameworks to identify and mitigate the impact of such misinformation. By utilizing a comprehensive dataset, we aim to explore patterns and similarities in fake news narratives propagated by Russian sources. Through clustering techniques, we will analyze the dissemination strategies and thematic consistencies of these false narratives, and compare these clusters with geopolitical regions affected by Russian influence. Such analysis is vital for developing countermeasures and informing policy-making to combat misinformation. Additionally, we will create an interactive visualization tool to map the spread of fake news, providing a dynamic representation of data over various time periods and thematic attributes. This visualization will enhance our understanding of the data and validate our findings. Furthermore, we will evaluate the clusters' relevance by examining their correlation with real-world events and their potential utility in predictive modeling. This study not only highlights the pervasive nature of Russian fake news but also contributes to the development of effective strategies to counteract the dissemination of manipulated information.
Inspired by famous TEDx talk of Simon Sinek about How great leaders inspire action we were wondering to answer three question:
1. WHY? To master our skills and contribute to developments that strengthen Ukraineβs position in the information war
2. WHAT? Develop a model capable of identifying fakes in news and social media
3. HOW? Applying modern ML algorithmes to fake and propaganda texts
Analysis of Disinformation and Fake News Detection Using Fine-Tuned Large Language Mode π Pavlyshenko, B. (2023)
Method and models for sentiment analysis and hidden propaganda finding π R. Strubytskyi, N. Shakhovska (2023)
Propaganda Detection in Text Data Based on NLP and Machine Learning π Oliinyk, V., Vysotska, V., Burov, Y., Mykich, K., Basto-Fernandes, V. (2020)
RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict π Yirong Zeng, Xiao Ding, Yi Zhao, Xiangyu Li, Jie Zhang, Chao Yao, Ting Liu, Bing Qin (2024)
The dataset used for training and testing the model consists of fact-checked claims, labeled as either 'Supported' (0) or 'Refuted' (1). The data is preprocessed to remove any rows with missing labels, and the labels are mapped to numeric values for compatibility with machine learning algorithms. The dataset is then split into training and testing sets using an 80-20 split ratio to evaluate the model's performance.
Text data is transformed into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. This method converts the raw text into a matrix of TF-IDF features, which reflects the importance of a word in a document relative to the entire dataset. The TfidfVectorizer
from the sklearn
library is used for this purpose.
A LightGBM classifier (LGBMClassifier
) is trained on the TF-IDF transformed training data. LightGBM is a gradient boosting framework that uses tree-based learning algorithms, known for its efficiency and high performance on large datasets.
To find the best hyperparameters for the LightGBM model, we use Optuna, a hyperparameter optimization framework. Optuna conducts a series of trials to maximize the model's accuracy on the validation set by tuning parameters such as learning rate, number of leaves, and regularization terms.
The optimized LightGBM model is evaluated on the test set to assess its accuracy. The accuracy scores for both the training and testing sets are calculated and displayed, providing insights into the model's performance and potential overfitting.
The trained LightGBM model and the fitted TF-IDF vectorizer are saved using joblib
. A Gradio interface is created to deploy the model as a web application, allowing users to input text and receive predictions on whether the text is 'Fake' or 'Not Fake'. The model and vectorizer are loaded in the Gradio app, and the input text is transformed and classified in real-time.
The following key Python libraries are used in this implementation:
pandas
andnumpy
for data manipulation and numerical operations.scikit-learn
for TF-IDF vectorization, model training, and evaluation.lightgbm
for implementing the LightGBM classifier.optuna
for hyperparameter optimization.joblib
for saving and loading the model and vectorizer.gradio
for creating the web interface to deploy the model.
This methodology ensures a systematic approach to preprocessing, model training, evaluation, and deployment, resulting in an efficient and user-friendly fake news classification tool.
- Fake and real news dataset on kaggle.com
- Propaganda Diary. Vox Ukraine
- Statistical GIS Boundary Files for London
- RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict
In order to be able to run our notebook, you should have a folder structure similar to:
.
βββ data # Folder with data
βββ models # Folder with models
βββ gradio # Folder with code for gradio deployment
βββ notebooks # Folder with all notebooks
β βββ data_extractor.ipynb # Notebook with extraction data from archieve
β βββ data_visualization.ipynb # Notebook with data EDA and topic modeling
β βββ hyperparams_tuning_optuna.ipynb # Notebook with best model tuning
β βββ model_deployment.ipynb # Notebook with code for deployment on gradio temporary URL
β βββ model_selection.ipynb # Notebook with best model selection
βββ .gitignore
βββ params.txt # Best model params
βββ environment.yml # File with ENV for conda
βββ README.md