/twitter-most-followed

How to find out who's popular for a particular group of Twitter users such as the Hacker News community.

Primary LanguagePythonMIT LicenseMIT

Twitter Most Followed

Rationale

How to Find Out Who's Popular on Twitter. And why there's no point in doing it

It’s easy if you consider the whole Twitterverse. You just look at the number of followers, and you’ll get @katyperry, @justinbieber, and @BarackObama. No surprise there, right?

But what if you want to focus on a particular group of Twitter users? Let’s take the Hacker News community. Which are the most followed accounts by the HNers? This is not a trivial exercise and we need a different approach, but if you’re a HNer, the result will be just as predictable.

Read the whole story here: https://medium.com/@ducu/how-to-find-out-whos-popular-on-twitter-d659884fd942

Approach

Let's take the Hacker News community as our target group.

There is a Twitter account named @newsyc20 (Hacker News 20 - "Tweeting Hacker News stories as soon as they reach 20 points."). We consider this as our source, and the followers of @newsyc20 as the HNers, our target group. To find out the top most followed accounts by our target group members, we get the complete lists of who each of them is following (aka friends), then we aggregate all those lists, counting the occurrences of each friend.

You can realize that the most followed account will be exactly @newsyc20, because all the members are following it. But who's on the 2nd and 3rd place? Who's in the top 100 most followed? This is what we're going to find out by running the routine below, a transcript from main.py

# Step 1: Identify the source
source_name = 'newsyc20' # target group source
source_id, source_data = load_user_data(screen_name=source_name)

# Step 2: Load target group members
followers = load_followers(source_id) # target group

# Step 3: Load friends of target group members
for follower_id in followers:
	load_friends(user_id=follower_id)

# Step 4: Aggregate friends into top most followed
aggregate_friends() # count friend occurrences
top_most_followed(100) # display results

Requirements

The Python script in this repo uses Twitter REST API to get the data, and Redis to store and aggregate it. To use Twitter API you need an existing application, and some access tokens.

There are a couple of performance issues though when dealing with big data sets.

Twitter API Rate Limits

We have 13.3K members in our HNers target group. In order to load the friends for each of these members (step 3), we're calling the Twitter API friends/ids method. This method is rate limited at 15 calls/15 minutes/token. We have to perform about 15.3K calls, since one call returns at most 5000 items. The problem is that with a single access token, it would take 10 days and 16 hours to get all this data.

Tweepy is the preferred Twitter API client for Python, and the current release works with a single access token. But here's a fork I created especially to extend Tweepy so it works with several access tokens in a round robin fashion transparently - https://github.com/svven/tweepy. Using about four dozen tokens, the overall retrieval time was reduced to 5 hours (2 hours work time, 3 hours sleep). With about hundred tokens added to RateLimitHandler, you would get maximum efficiency out of a single Tweepy API object. See how it's done in get_api() from twitter.py.

Redis ZUNIONSTORE

After storing all this data, we have about 12.4K simple sets of friend ids in Redis, one set for each of our target group members. We are short of almost 1K sets because there are that many protected Twitter accounts so we cannot get their friends from Twitter API. There's an average of 1.3K items per set, ranging from 1 to 8.6M maximum items, a total of 16.2M items.

Aggregating all these sets can be easily done using the ZUNIONSTORE command with the default weight of 1. See RedisStorage.set_most_followed() method in storage.py. The problem is that for this workload, ZUNIONSTORE took more than 1 hour to execute on my 4GB machine. That was surprisingly slow, having a recent stable release of Redis, ver 2.8.9.

It turned out that a performance patch for this command has been recently added, but it is only available in the beta 8 release of Redis, ver 3.0.0. You can read about it in the release notes. Having installed this, running the ZUNIONSTORE on the same data set took less than 2 minutes.


To conclude, in order to run this exercise for a big data set, make sure you have a bunch of access tokens that you can use in config.py, and install Redis 3.0.0 beta 8. Then pip install -r requirements.txt in your virtualenv so you have following packages

hiredis==0.1.4
redis==2.10.1
git+https://github.com/svven/tweepy.git#egg=tweepy

Credits

Thanks to Jeff Miller (@JeffMiller) for @newsyc20. It's one of the best Hacker News Twitter bots. Jeff actually did a similar analysis on the Hacker News community, but with a slightly different approach.

Many thanks also to Josiah Carlson (@dr_josiah) for his valuable support on Redis related issues.

Results

Finally here's the top 100 most followed accounts by the Hacker News community.

Followers and Friends columns show the total count.
Popularity equals the number of followers only from within our HNers target group. The results are ranked by this value. Protected Twitter accounts were not considered, that's where the difference between @newsyc20 popularity (12476) and followers count (13377) is coming from.

I created a Twitter list with this top 100 for your convenience. You can subscribe to it here - https://twitter.com/ducu/lists/hners-most-followed

Drop me a line if you want more data, I have the complete top HNers' most followed, or if you need any help running this exercise. You can easily change the starting source, just replace 'newsyc20' in main.py with any other Twitter handle, and find out the results for yourself.

Cheers, @ducu

Rank Popularity Followers Friends Name (@twitter)
1 12476 13377 0 Hacker News 20 (@newsyc20)
2 5266 3781036 872 TechCrunch (@TechCrunch)
3 4600 17099119 165 Bill Gates (@BillGates)
4 3921 8836866 411 A Googler (@google)
5 3890 3115526 72 WIRED (@WIRED)
6 3562 31793932 131 Twitter (@twitter)
7 3488 4289712 2773 Mashable (@mashable)
8 3410 151838 2197 The Hacker News (@TheHackersNews)
9 3219 45567454 648558 Barack Obama (@BarackObama)
10 2926 1436383 39 Lifehacker (@lifehacker)
11 2894 12847085 985 The New York Times (@nytimes)
12 2774 1789255 1234 Tim O'Reilly (@timoreilly)
13 2666 5679068 111 The Economist (@TheEconomist)
14 2652 2618138 1195 Jack (@jack)
15 2615 182848 107 Paul Graham (@paulg)
16 2604 864374 36 Elon Musk (@elonmusk)
17 2528 1161272 1164 The Next Web (@TheNextWeb)
18 2490 44628570 823 YouTube (@YouTube)
19 2464 18354871 108 CNN Breaking News (@cnnbrk)
20 2451 340889 184 GitHub (@github)
21 2429 3112388 299 TED Talks (@TEDTalks)
22 2414 25807 0 Hacker News Bot (@hackernewsbot)
23 2410 4893760 914 Wall Street Journal (@WSJ)
24 2362 713304 134 Ars Technica (@arstechnica)
25 2361 36302 332 Hacker News Network (@ThisIsHNN)
26 2353 7330317 226 NASA (@NASA)
27 2351 1471424 566 Kevin Rose (@kevinrose)
28 2338 650732 331 marissamayer (@marissamayer)
29 2309 852540 207 Eric Schmidt (@ericschmidt)
30 2300 169518 111 Y Combinator (@ycombinator)
31 2272 3654098 102 Dropbox (@Dropbox)
32 2160 327636 1445 VentureBeat (@VentureBeat)
33 2116 410004 43534 Robert Scoble (@Scobleizer)
34 2098 1261958 3892 Fast Company (@FastCompany)
35 2058 55299 120 Household Hacker (@householdhacker)
36 2049 4364135 3848 Richard Branson (@richardbranson)
37 2046 1420148 2634 ReadWrite (@RWW)
38 2032 1315334 16407 Forbes Tech News (@ForbesTech)
39 2027 1051259 95 Engadget (@engadget)
40 2006 34642373 17 Instagram (@instagram)
41 1999 348795 897 Fred Wilson (@fredwilson)
42 1987 1027214 78 Gizmodo (@Gizmodo)
43 1984 1738702 1677 Ev Williams (@ev)
44 1981 1648398 181 Harvard Biz Review (@HarvardBiz)
45 1965 2343719 1 WikiLeaks (@wikileaks)
46 1956 673992 108 Medium (@Medium)
47 1947 2182676 624 Biz Stone (@biz)
48 1939 10964049 3 BBC Breaking News (@BBCBreaking)
49 1912 233408 13254 Dave McClure (@davemcclure)
50 1905 1011590 137 Google Developers (@googledevs)
51 1893 599434 585 Walt Mossberg (@waltmossberg)
52 1892 525973 115 The Verge (@verge)
53 1881 6846181 495 Breaking News (@BreakingNews)
54 1878 3506893 4727 Forbes (@Forbes)
55 1865 6112103 12 The Onion (@TheOnion)
56 1852 1393171 1482 Om Malik (@om)
57 1850 11566965 1 Conan O'Brien (@ConanOBrien)
58 1848 80316 1 Hacker News (@newsycombinator)
59 1840 13892218 89 Facebook (@facebook)
60 1810 251664 26 Gigaom (@gigaom)
61 1799 2249638 23233 Guardian Tech (@guardiantech)
62 1797 139903 951 Chris Dixon (@cdixon)
63 1793 13275 3390 Hacker Fantastic (@hackerfantastic)
64 1789 13746311 975 CNN (@CNN)
65 1777 212578 3 Techmeme (@Techmeme)
66 1777 4833004 1042 Reuters Top News (@Reuters)
67 1741 6083682 1614005 Hootsuite (@hootsuite)
68 1740 5930546 26 Android (@Android)
69 1707 9264066 0 Dalai Lama (@DalaiLama)
70 1701 180266 731 Eric Ries (@ericries)
71 1696 188864 1707 Michael Arrington (@arrington)
72 1684 6112812 772 TIME.com (@TIME)
73 1664 216829 1918 500 Startups (@500Startups)
74 1618 7118939 61 BBC News (World) (@BBCWorld)
75 1613 3398866 268 The New Yorker (@NewYorker)
76 1605 142123 160 Jeff Atwood (@codinghorror)
77 1601 157463 4070 Marc Andreessen (@pmarca)
78 1601 213855 435 Reid Hoffman (@reidhoffman)
79 1583 1587947 533 Chris Anderson (@TEDchris)
80 1560 5232083 1165 Microsoft (@Microsoft)
81 1560 964180 1013 Kara Swisher (@karaswisher)
82 1556 1222127 490 dick costolo (@dickc)
83 1556 1466482 799 Chris Sacca (@sacca)
84 1554 6826374 1 Stephen Colbert (@StephenAtHome)
85 1552 808219 1159 Smashing Magazine (@smashingmag)
86 1552 119710 184 DHH (@dhh)
87 1545 2312490 44 Neil deGrasse Tyson (@neiltyson)
88 1529 170710 694 Mark Suster (@msuster)
89 1528 1306755 873 Anonymous (@YourAnonNews)
90 1527 2356613 766 Mark Cuban (@mcuban)
91 1523 4373502 85 Google Chrome (@googlechrome)
92 1522 119780 3 Venture Hacks (@venturehacks)
93 1506 149725 998 MG Siegler (@parislemon)
94 1503 158058 3044 John Resig (@jeresig)
95 1496 6188 0 Hacker News 100 (@newsyc100)
96 1484 15775 2 News.YC (@HackerNews)
97 1468 4981 0 Hacker News 50 (@newsyc50)
98 1463 258575 199 Google Ventures (@GoogleVentures)
99 1451 366656 370 Matt Cutts (@mattcutts)
100 1451 4504186 5558 Huffington Post (@HuffingtonPost)