Grouping and Ranking Twitter posts
Grouping same Twitter post appearing many times due to retweets and comments
Ranking these groups of Twitter posts by popularity
Constraints :
- Parameters in input data - tweet, created_at, author's followers_count, author's screen_name
- No other data from twitter.com can be used
Output
- Popularity Rank score for each group
- Output file - groups are saved rank wise
- Output file parameter - group_no, no_of_followers, no_of_tweets, engagement_time_in_seconds, max_date, min_date, no_of_tweets_scaled, no_of_followers_scaled, engagement_time_in_seconds_scaled, rank_score, tweets
- rank_score = no_of_tweets_scaled + no_of_followers_scaled + engagement_time_in_seconds_scaled
Note: Output is shown as print and saved as csv file
Note: Python code is pep8 compliant
Python 2.7
Main Libraries Used -
- pandas
- fuzzywuzzy
- ijson
- scikit-learn
$ git clone https://github.com/ayushaggar/twitter_post_rank.git
$ cd twitter_post_rank
$ pip install -r requirements.txt
For Output - Task 1 and Tak 2
$ python model.py
-
Processing techniques used -
- Lower Case - convert all tweets to lower case
- Date format - convert created_at to date format
-
Sentence Similarity -
It is find by considering small typos by using partial_ratio in fuzzywuzzy. It uses Levenshtein distance which is a is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
-
Cases considered -
-
It groups -
If 2 or 1 retweet post is there but not original tweet in data provided.
If anything is added before post like RT @ or MT @ or both.
Capitalization, punctuation, or white-spacing is taken care of
-
Position of words should be fixed.
-
These are different group
RT @ this is a task number 1
RT @ this is a number 1 task
-
These are same group
RT @ this is a task number 1
RT @ this is a task number 1!
-
-
Scaling - To get rank score scaling is used to bring all features at same scale.
Different features extracted -
- total no_of_followers
- no_of_tweets
- engagement_time_in_seconds
- max_date of tweet in that group
- min_date of tweet in that group
-
Popularity - So what will be popular group? One which has highest number of followers, group which have hughest number of retweet in data provided or the tweet which is retweet after two days also? Popularity mainly depend on user profile. So a combination of all is chosen. The problem gets complicated pretty quickly.