/usc-tg-24-us-election

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

A billion Telegram messages about the 2024 US presidential election

Releases

v2 (12-02-24)

v1 (11-01-24)

The collection will continue at least until the end of the year and the dataset will be updated every month.

Feel free to reach out with any questions!

Data usage agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license and cite the following manuscript:

Leonardo Blas, Luca Luceri, and Emilio Ferrara. Unearthing a Billion Telegram Posts about the 2024 U.S. Presidential Election: Development of a Public Dataset. 2024. https://doi.org/10.48550/arXiv.2410.23638.

Instructions

If you you have files like scraped_part_*---some distributions may feature one single scraped.tar.zst---combine them like:

cat scraped_part_* > scraped.tar.zst

Once you have scraped.tar.zst, decompress like:

tar --use-compress-program=unzstd -xvf scraped.tar.zst

As mentioned in the paper, some Telegram objects in the SQLite databases were JSON-serialized, UTF-8 encoded, and zlib compressed; a version of this dataset in which all zlib-compressed entries are decompressed may consume thrice as much space. If storage is a concern, it is recommended to decompress and analyze Telegram objects at runtime. If you still wish to decompress a .db file, you can use decompress.py.

Top chats

The N top chats---ranked via unique incoming edge count, an influence proxy metric---can be determined using chats.db and get_top_chats.py. The default N is 500, but this can be changed within the script.

Acknowledgements

To all torrent seeders worldwide, thank you for mirroring the dataset! It's greatly appreciated!
Additionally, we thank AcademicTorrents.com for making our data available worldwide!