/PersianTelegramData

Persian Telegram Data gathered from 8 July 2021 to 22 July 2021

Primary LanguagePython

PersianTelegramData

Persian Telegram Data gathered from 8 July 2021 to 22 July 2021

This dataset contains six columns:
context: the text which is sent
sender_username: id of telegram channel
sender_name: name of telegram channel
keywords: list of keywords
hashtags: hashtags used in the context
send_time: send time of the message in UTC DateTime

How To Detect Keywords:
We use bert( a contextualized word embedding based on Transformer) to convert words to meaningful vectors. The words that have the most cosine similarity to the context are keywords. To do this, we extract some candidate words and preprocess the context. Preprocessing has these functions:

  1. Normalizing the context using Hazm library
  2. Tokenizing
  3. Using POS tagger to find Verbs in context
  4. Detect stop words (words and their stemmed form must not be stop words)

Words that are not verbs, stopwords, and numbers can be a keyword