/Prigozhin_Audio_Files

Analysis of Yevgeny Prigozhin's audio files on Telegram using the GPT-3.5-turbo API. Part of a data journalism coursework at Imperial College London.

Primary LanguagePython

Analysing Prigozhin's audio messages with LLMs

This repository contains the code to load and analyse over 400 audio messages published by Yevgeny Prigozhin on Telegram channels directly associated with the Wagner group. The transcription of the messages has been curated by the creator of the dataset Giorgio Comai (more information here).
Considering the significant noise present in these transcriptions, especially due to spelling mistakes and the use of both Russian and Ukrainian names for geographical locations, it is difficult to rigorously query the dataset.
An easier way consists in prompting a Large Language Model (LLM), carefully asking it to extract useful information (such as sentiment and geographic locations mentioned) message by message. Because of the limited size of the dataset, performing the analysis with the GPT-3.5-turbo API won't cost more than 20 cents. For such a complex task we can expect the model to make mistakes, requiring some manual cleaning.
The src folder contains functions used for processing the original .csv and subsequent intermediate results, in addition to helper functions used to interact with the OpenAI API or to work on geographical data. The analyses folder contains the main.py file used to perform the LLM analysis, in addition to plot_hist.py and plot_map.py necessary to reproduce the plots in outputs. Lastly, reports contains the $\LaTeX$ file needed to reproduce the .pdf article.