data-challenge-2014: A Python repository from abrie

Motivation

Github Data Challenge 2014

Results

Bundled-Edged Views of the Github Event Graph

Methodology and Source Code

At the heart of this plot are the SQL (or rather, BQL) queries sent to Google Bigquery. The queries consist of two types: a model query and state query. The model query collects the data necessary for building a markov matrix by counting transitions between sequential events. The state query computes a census of the most recent events (i.e. events not followed by another event). These two sets of data are then "munged" by munger.py. A cluster detection algorithm is used to group events. The results are gathered into a single JSON structure (example:results.json). The front end retrieves the results via AJAX and generates the illustrations using D3js.

Dependencies

This application uses MCL. The source is contained in the external/ directory. Install as follows:

tar xfz mcl-latest.tar.gz
cd mcl-14-137
``configure --prefix=`pwd```
make
make install

MCL will then be installed to external/mcl-14-137. If you install to a different path, then change the MCL_BIN string found in mclinterface.py.

Numpy/Scipy are also required. If you're using Mavericks, use this: ScipySuperpack.

Authorization of APIs

This application uses Google Bigquery. You'll need to supply authenticated credentials:

Log into Google Developer Console
Navigate to the project list
Create a new project. (Or use one you may have previously created :)
Enable the BigQuery API: Select Project -> APIs and Auth -> API's -> BigQuery
Generate a client_secrets JSON -> API's and Auth -> Credentials -> Create New Client ID
Download the generated JSON and save as client_secrets.json to the root of this project.
When you run the app a browser window will open and request authorization.
Authorize it.

Data Collection and Munging

main.py is where interested readers should begin. It is invoked as follows:

python main.py -i identifier -q bigquery-id model:model.sql state:state.sql

-i [setId] This identifies the set. The query results will be stored in a folder named data/[setId]. If no query is specified using -q, then the most recent queries in the [setId] folder will be (re)munged.
-q [projectId] [name:sql name2:sql2] ... projectId is a BigQuery project number (ex: 'spark-mark-911'). The [name:sql] entries specify sql files and the id to use when storing the results. Each of the sql files will be sent to BigQuery, and the responses recorded under data/[setId]/[name]. The munger will subsequently process the responses to produce results.json.

Use the Scripts

collect.sh demonstrates the use of main.py. It is the same script used to generate the results used by this page. If you wish watch it operate:

./collect.sh [projectId] You'll need to specify the projectId obtained from your Google developer console.

deploy.sh generates the presentation pages and writes them to the specified directory. It assumes that collect.sh has completed successfully. The generated site should be served through a webserver because the results.json files are loaded through Ajax. If you do not have a local webserver then node http-server or Python's SimpleHTTPServer are easy and recommended:

./deploy.sh deployed/path/
http-server deployed/path/ or cd deployed/path && python -m SimpleHTTPServer 8080
Navigate to http://localhost:8080

Citations

Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. link

Contact

abrhie@gmail.com

Addendum

These images show the evolution from very hairball, to less hairball, to combed hairball.

More Features

Please visit the dev branch: README.md for additional features developed after submission. The conclusion of this competition shares poetic cohomology with a certain video of a polar bear and a can of condensed milk (video no longer available).

abrie/data-challenge-2014