twitter_post_rank: A Python repository from scharger

Objective

Grouping and Ranking Twitter posts

Grouping same Twitter post appearing many times due to retweets and comments

Ranking these groups of Twitter posts by popularity

Constraints :

Parameters in input data - tweet, created_at, author's followers_count, author's screen_name
No other data from twitter.com can be used

Output

Popularity Rank score for each group
Output file - groups are saved rank wise
Output file parameter - group_no, no_of_followers, no_of_tweets, engagement_time_in_seconds, max_date, min_date, no_of_tweets_scaled, no_of_followers_scaled, engagement_time_in_seconds_scaled, rank_score, tweets
rank_score = no_of_tweets_scaled + no_of_followers_scaled + engagement_time_in_seconds_scaled

Note: Output is shown as print and saved as csv file

Note: Python code is pep8 compliant

Tools use

Python 2.7

Main Libraries Used -

pandas
fuzzywuzzy
ijson
scikit-learn

Installing

$ git clone https://github.com/ayushaggar/twitter_post_rank.git
$ cd twitter_post_rank
$ pip install -r requirements.txt

For Output - Task 1 and Tak 2

$ python model.py

Various Steps in approach are -

Processing techniques used -
1. Lower Case - convert all tweets to lower case
2. Date format - convert created_at to date format
Sentence Similarity -

It is find by considering small typos by using partial_ratio in fuzzywuzzy. It uses Levenshtein distance which is a is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
Cases considered -
1. It groups -
  
  If 2 or 1 retweet post is there but not original tweet in data provided.
  
  If anything is added before post like RT @ or MT @ or both.
  
  Capitalization, punctuation, or white-spacing is taken care of
2. Position of words should be fixed.
3. These are different group
  
  RT @ this is a task number 1
  
  RT @ this is a number 1 task
4. These are same group
  
  RT @ this is a task number 1
  
  RT @ this is a task number 1!
Scaling - To get rank score scaling is used to bring all features at same scale.

Different features extracted -
1. total no_of_followers
2. no_of_tweets
3. engagement_time_in_seconds
4. max_date of tweet in that group
5. min_date of tweet in that group
Popularity - So what will be popular group? One which has highest number of followers, group which have hughest number of retweet in data provided or the tweet which is retweet after two days also? Popularity mainly depend on user profile. So a combination of all is chosen. The problem gets complicated pretty quickly.

scharger/twitter_post_rank

Objective

Tools use

Installing

Various Steps in approach are -