GitHub LDA is a library that applies topic modeling on GitHub repos to improve repository recommendation.
wget https://github.s3.amazonaws.com/data/download.zip
unzip download.zip
mkdir repo_dir
github_lda clone -i download/repos.txt -o repo_dir [-p 4]
As there are around 120,000 repositories to download, this will take a VERY long time and will eat up a huge chunk of disk space (up to 1TB). You can specify the number of clones to run in parallel by using the -p option. In order to avoid the number of directories limit in *nix, by default it will subdivide the repositories into 13 subdirectories as follows:
repo_dir
|---0
| |---1
| |---2
| |---...
| `---9999
|---1
| |---10000
| |---10001
| |---...
.
| `---119999
`---12
|---120000
|---120001
|---...
`---123344
mkdir term_freq_dir
github_lda calctf -i repo_dir -o term_freq_dir [--stopwords=/path/to/stopwords] [--lang=ruby,javascript] [--process=1]
You can limit the repositories of interest by using the --lang option. By default, term frequencies for source files of all programming languages will be calculated. Refer here for the list of available language options. You can also specify the number of processors to run on by using the --process option.
Generate mult.dat, user.dat, item.dat, and vocab.dat in specified directory
mkdir data
github_lda generate --tf term_freq_dir -i download/data.txt -o data
mkdir lda-result
lda est 0.1 100 settings.txt data/mult.dat random lda-result
ctr --user data/user.dat --item data/item.dat --mult mult.dat \
--theta_init lda-result/final.gamma --beta_init lda-result/final.beta
- Wikipedia article on LDA
- Official GitHub Blog post: The 2009 GitHub Contest
- Official GitHub Blog post: About the GitHub Contest
Chong Wang and David M. Blei. 2011. Collaborative Topic Modeling for Recommending Scientific Articles. In Proc of KDD'11 [pdf].