My scrapers + data + analysis for PyConCanada2015 Keynote

(sorry github)

Frequency of libraries in requirements.txt files in Github Python repositories

This was done by scraping 10k+ Python repositories on Github that contain a requirements.txt file. This file is commonly used to store dependencies of the repository.


It's clear that the majority of repositories on Python are web development related, or web developers are most likely to include a proper requirements.txt file in their repositories.

Relationships between libraries

Using the data in `requirements.txt' files, we can find common co-occurences of libraries. For example, it's not hard to imagine that whenever django is a requirement, so is psycopg2. In fact, in the dataset I had, 41% of all django apps also included psycopg2. These relationships can be mined using a simple algorithm called the apriori algorithm. It's history goes back to large department stores that were interested in what products were commonly bought together. The naive solution, compare all possible pairs, results in a quadratic algorithm - and if you have thousands of products, this becomes inefficient quickly. The apriori algorithm intelligently cuts through this massive space.

Here are the other common libraries paired with django:

(confidence is defined as

confidence = P(ending_with | starting_with)
           = P(starting_with and ending_with) / P(starting_with)
           = #{requirement.txts with both} / #{requirement.txts with starting_with}

starting_with ending_with starting_with_occurrences confidence occurrences ending_with_occurrences
django requests 2714 0.243920412675 662 2463
django wheel 2714 0.22402358143 608 1649
django six 2714 0.245394252027 666 1985
django psycopg2 2714 0.411569638909 1117 1573
django gunicorn 2714 0.320191599116 869 1531
django dj-database-url 2714 0.263448784083 715 728

Here are the results for other libraries including some metrics to sort on. To read more about these metrics, see this link.

Now, let's recommend libaries based on these relationships

So, if we know a user installed django, we can perhaps recommend that they also install psycopg2 (according to above, we would be right 41% of the time). We can turn these co-occurences into a very simple recommendation algorithm for Python Libaries! So I've gone ahead and done that.

pipp: one of the ps stands for personalized!

Yes, that's right - we can bring you library recommendations right to the command line. Try it out!

pip install pipp

$ pipp install jsmin
Requirement already satisfied (use --upgrade to upgrade): jsmin in /Users/camerondavidson-pilon/.virtualenvs/data/lib/python2.7/site-packages
pipp: Other users who installed jsmin also installed cssmin

Command line too nerdy for you? How about recommendations on PyPI?


Network force-layout of libraries in requirements.txt files in Github Python repositories

This is biased, as some libaries have their own requirements. For example, Pandas depends on Numpy, so it would be less common to have both Pandas and Numpy in a requirements.txt file.


The Plural of Ancedote is Data!

I've often heard techies say the plural of ancedote is not data. I see where they are coming from, however I think they are being shortsighted. Whereas one or two occurences of something is not enough evidence to prove a fact, it is evidence of something interesting. And if you have a tool to quickly confirm or deny further occurences of this anecdote, then yes you have data. For example how often do you see links to stackoverflow questions in code? I have seen it before, and I wondered, how common is this?

Using one of the greatest anecdote validation tools, Search, we can validate this idea:


Great, now let's start scraping. Here are the most common questions linked in Python code:

stackoverflow.com/questions/19622133    1173
stackoverflow.com/questions/279237       887
stackoverflow.com/questions/5658622      320
stackoverflow.com/questions/22019341     134
stackoverflow.com/questions/35817        117
stackoverflow.com/questions/1769332       89
stackoverflow.com/questions/377017        86
stackoverflow.com/questions/1189781       73
stackoverflow.com/questions/4124220       70
stackoverflow.com/questions/701802        66

Let's investigate the first one. It's a very specific question about windows and ctypes - not a common problem in the first place. If we search for just that url on Github, we see it's all from the same file, windows_support.py. Investigating those repos with the url, we see that 1. not only is this is from code inside Python 2.7, but 2. people are including all of Python 2.7 in the Github repos!

Most controversial Python StackOverflow answer

StackOverflow has become the most popular forum for developers to ask, answer and importantly promote or demote content. StackOverflow does something even more incredible: they expose all their interaction data (questions, answers, views, votes) through a public query interface. Using this, we can compute, what is the most controversial Python answer?

To do this, we will use the following algorithm: find the answer that has an upvote/downvote ratio close to 0.5, and also has lots of votes. The former requirement is a good definition of "controversial", and the latter requirement protects use against answers with trivial counts (ex: 1 upvote and 1 downvote). Think of it as a balancing act between "how confident are we that this question is indeed the most controversial?" The following query accomplishes this (based on a similar equation in this post)

declare @VoteStats table (parentid int, id int, U float, D float) 

insert @VoteStats
  CAST(SUM(case when (VoteTypeID = 2) then 1. else 0. end) + 1. as float) as U,
  CAST(SUM(case when (VoteTypeID = 3) then 1. else 0. end) + 1. as float) as D
FROM Posts q
JOIN PostTags qt 
  ON qt.postid = q.ID
JOIN Tags T 
  ON T.Id = qt.TagId
JOIN Posts a 
  ON q.id = a.parentid
JOIN Votes 
  ON Votes.PostId = a.Id
WHERE TagName  = 'python'
   and a.PostTypeID = 2 -- these are answers
Group BY a.id, a.parentid

set nocount off

 TOP 100
 U, D,
 ABS(0.5 - U/(U+D) - 3.5*SQRT(U*D / ((U+D) * (U+D) * (U+D+1)))) + 
   ABS(0.5 - U/(U+D) + 3.5*SQRT(U*D / ((U+D) * (U+D) * (U+D+1)))) as Score
FROM @VoteStats 

Running this produces the following table (as of Oct. 24, 2015):

parentid url U D Score
1641219 http://stackoverflow.com/questions/1641305 100 58 0.267581687129904
366980 http://stackoverflow.com/questions/367082 55 29 0.360985397926758
904928 http://stackoverflow.com/questions/904941 44 40 0.379197639329681
1641219 http://stackoverflow.com/questions/1945699 49 23 0.382002382488145
734368 http://stackoverflow.com/questions/734910 48 30 0.38315203605798
7479442 http://stackoverflow.com/questions/7479473 46 23 0.394405318873308
620367 http://stackoverflow.com/questions/620397 42 24 0.411383595098925
969285 http://stackoverflow.com/questions/969324 49 20 0.420289855072464
1566266 http://stackoverflow.com/questions/1566285 39 24 0.424918292799399

The closer the score is to 0, the more controversial it is. Take a look at the answers comment's to see debates about why the answer is controversial.

2-Spaces vs 4-Spaces

Let's not argue: let's look at the empirical data. I looked at over 23 thousand Python repos and computed what the most common indenting practice was in each repo. The results were quite infavor of 4-spaces: 88% of repos used 4-spaces, and only 7% of repos use 2-spaces. What about the remaining 5%? Well, some repos use 8-spaces, and some used 1-spaces! Examples: https://github.com/aqt01/UnderWaterWorld uses 8-spaces, and https://github.com/sanglech/CSC326 uses 1-space.


What is the most popular testing framework?

Passing through the tens of thousands of repos, I looked for imports of the most popular testing libaries: pytest, unittest, nose and testify. Here where the results:

package count percent of total
None 22162 86%
unittest 3032 12%
nose 379 1.5%
pytest 293 1%
testify 4 ~0%


What about using Python for functional programming?

If you are going to use Python for functional programming, or semi-functional programming, you're probably going to be using libraries like functools, 'itertools', 'toolz' and others. How many Python repos use this style of programming? Data shows about 15% of repos do this.

How often do we disobey flat is better than nested?

from com.sun.org.apache.xerces.internal.impl.io import \

(from here)

Is this ugly or beautiful? Python says it's ugly - after all, flat is better than nested. How often we break this? For this, I looked at the maximum import nest in each repo. Here's the breakdown:


Topic modelling Python source code using LDA

What happens when we apply a topic modelling algorithm, like Latent Dirichlet Allocation, to hundreds of thousands of Python source code files? To be clear: this is not something you usually do! Topic modelling is meant to articles and reviews: human-readable text. Python code, on the other hand, is full of keywords in illogical order, repeated words over and over again, and developers use odd acronymns and abbreviations for all their variables! But, let's try it anyways.

After training LDA on the repos and library I downloaded, I came out with these topics. For example, we can see the topic:

python, version, package, author, setup, description, language, copyright, packages, license

obviously this is the setup.py topic

test, equal, case, tests, foo, unittest, equals, suite, result, expected

this is the testing topic,

grid, color, plot, plt, label, step, data, width, ax, size

the matplotlib plotting topic.

See if you can find others in the output above.
