Currently, GitHub has two possibilities to explore users and repositories:
- Direct search by search term leveraging names and tags.
- Recommender system under 'Explore' tab which gives suggestions to a user based on his usage of service. However, there is no possibility to perform a search of connected entities. E.g., find repositories or users highly related to each other.
The goal of this project is to build GitHub repository search/recommender system, which would allow exploring connected repositories and people, by leveraging the underlying graph structure of the repositories database.
It was decided to build graph nodes embeddings (repo2vec
and user2vec
) for the entire GitHub database using PyTorch-BigGraph (PBG). On top of the embeddings representation, we have built query tool with the ranking engine.
Data: http://ghtorrent.org/downloads.html
- Change
resources/config.template.json
toresources/config.json
with your info; - Download SQL dump you like (here we use
2019-06-01
) atdata/
folder (rundb_download.sh
script (at terminal)); - Run
project_notebook.ipynb
notebook; - View
tb/README.md
for more info about TensorBoard launch with prepared embeddings and metadata (docker based, but it is possible to run without it if needed); - Modify code the way you like to find some new insights and share with us!
Visualizations with different kind of tensors (embeddings) are available at TensorBoard: http://hel.sergibro.me:8002/#projector [hope not to forget to update if it moves] Hints:
- open from desktop browser (it fetch hundreds of MB for larger tensors and computations done on the client side!);
- for better visual experience run T-SNE instead of PCA for
500-1K
iterations on large tensors with5-15
perplexity and learning rate set to1
(from our experience); for smaller tensors you can play more due to fewer computations (but losing in data points); - you may choose feature to be colored by (language for repos, type for users, etc.)
Contacts: https://t.me/sergibro