
Getting programming language statistics for machine learning

Primary LanguagePython

GitHub skills dataset creator

This software is available to public

Copyright (C) 2013 - WikiTeams contributors

This project allows to parse big number of users skills statistics (by pull requests) of "most active GitHub users" using (by choice) GitHub API (with CoffeeScript) and/or Google BigQuery. Data is later used for machine learning and predicting on which project a hypotetical new GitHub user would like to work on.

Branches: head, pygithub

Merged to head (as on 19.09.2013), pygithub is a dev-branch

Data already collected:
users and their pull requests.csv

It's a base for big data analysis and machine learning

It holds all PULL REQUESTS, in a format: repository name, count of skill (language), skill (language) name, user


It holds most active GitHub users (by paulmillr conditions) and their 3 most often used languages


It holds also their repos


It holds only their logins

Google bigquery

It works by querying Google GitHub timeline for fields: repository_name, count(payload_pull_request_head_repo_language), payload_pull_request_head_repo_language, payload_pull_request_head_user_login and grouping them by payload_pull_request_head_user_login, payload_pull_request_head_repo_language, repository_name

Input for iterate.py script

CSV file in format:




etc. (plain username)


CSV file in format:

username, repo


fabpot, linux_kernel3

fabpot, swap_unix

Learning function

This is the data we input to the learning machine during LEARNING PHASE:

{user: {Skills:experience}, label} , {repository: {Skills:statistics}, label}

which means a set of users and their contribution to repositories characterized by language and intensivity (how many times contributed)

and later for standard input we enter hypothetical users:

{user: {Skills:experience}}

for every user, we want on output a repository which he will probably would enjoy (repo already existing in dataset from learning phase)