PersianTelegramData

Persian Telegram Data gathered from 8 July 2021 to 22 July 2021

This dataset contains six columns:
• context: the text which is sent
• sender_username: id of telegram channel
• sender_name: name of telegram channel
• keywords: list of keywords
• hashtags: hashtags used in the context
• send_time: send time of the message in UTC DateTime

How To Detect Keywords:
We use bert( a contextualized word embedding based on Transformer) to convert words to meaningful vectors. The words that have the most cosine similarity to the context are keywords. To do this, we extract some candidate words and preprocess the context. Preprocessing has these functions:

Normalizing the context using Hazm library
Tokenizing
Using POS tagger to find Verbs in context
Detect stop words (words and their stemmed form must not be stop words)

Words that are not verbs, stopwords, and numbers can be a keyword

rominaoji/PersianTelegramData

PersianTelegramData