Persian Telegram Data gathered from 8 July 2021 to 22 July 2021
This dataset contains six columns:
• context: the text which is sent
• sender_username: id of telegram channel
• sender_name: name of telegram channel
• keywords: list of keywords
• hashtags: hashtags used in the context
• send_time: send time of the message in UTC DateTime
How To Detect Keywords:
We use bert( a contextualized word embedding based on Transformer) to convert words to meaningful vectors. The words that have the most cosine similarity to the context are keywords. To do this, we extract some candidate words and preprocess the context.
Preprocessing has these functions:
- Normalizing the context using Hazm library
- Tokenizing
- Using POS tagger to find Verbs in context
- Detect stop words (words and their stemmed form must not be stop words)
Words that are not verbs, stopwords, and numbers can be a keyword