In this homework, we will dive deeper into using both Git and GitHub, the web-based service built on top of Git. We will be exploring graphs to better understand project activity and will iteratively produce node-link layouts of GitHub data.
Similar to Homework 1 we won't start with building the actual visualization right away. Instead, we are going to build up some background knowledge, which will be useful for the final visualization.
Design Studio - We recommend you start by reading the entirety of the homework. Problem 3 is a design problem and will be addressed during the design studio the week following the release of this homework. However, you can start sketching before and use those sketches for this homework, and of course for the design studio.
Homework 1 Feedback - Please fill out the feedback form for homework 1. We would like to hear what you found engaging and measure the difficulty — your feedback allows us to balance tasks better.
- Refresh your GitHub skills by looking at Lab 1.
- If you're not familiar, review graph data structures and its various design variations as well as the lecture held on Feb. 14.
- Read the D3 chapters from Scott Murray's book.
Answer the questions in a file problem_1_answers.md.
As you may already know, GitHub tracks two types of data: the underlying Git revision data, such as forks or commits, and GitHub-specific "social" data, such as users avatars, favorite projects, and following/followers, among others.
GitHub provides two ways to access both revision and social data:
- The GitHub website, containing accessible information for use by humans.
- The GitHub API with raw data, for structured access by machines through programming languages.
Note that not all social data are shared by GitHub, mostly for privacy reasons.
To access its raw data, GitHub provides a public Application Programming Interface (API). An API is a common interface for web sites (or software frameworks and libraries) to allow and control limited access to its database. It is usually well-documented (see the GitHub API) and it is possible to use the API to query the database directly in a browser, just by supplying the correct URL. Here is an example of using the GitHub API to query the Django repository.
A program can query the API in two ways:
- By first downloading the data from the API, and then using it offline (changes on the servers won't be reflected in the downloaded data from that moment on).
- By calling the API every time the programs runs (the data will be up-to-date at each API call).
In both cases, you'll need to find the URL that corresponds to your query using the GitHub API.
For instance, you can download data from GitHub by using the curl command line tool using the syntax shown below. In case of a security error, use the -k
parameter or use another way to download a file from the page (e.g., use your browser).
curl https://api.github.com/repos/django/django -o repos_django.json
Your should see output similar to the following (partial) example. The output given is in JSON, a common and convenient file format that works well with D3. To understand JSON formatting, read the JSON documentation.
{
"id": 4164482,
"name": "django",
"full_name": "django/django",
"owner": {
"login": "django",
"id": 27804,
"avatar_url": "https://gravatar.com/avatar/fd542381031aa84dca86628ece84fc07",
"gravatar_id": "fd542381031aa84dca86628ece84fc07",
"url": "https://api.github.com/users/django",
To learn more about the GitHub API, read the related Github API documentation.
As an alternative to downloading data, you can load the URL from the GitHub API. GitHub allows Cross Origin Resource Sharing, which means you can directly call the API from Javascript as follows:
d3.json("https://api.github.com/search/repositories?q=visualization+language:javascript&sort=stars&order=desc", function(data) {
})
This should produce the same output as shown above.
Warning: GitHub enforces a request limit of 60 unauthenticated requests per hour. If you are consistently hitting that limit, you can raise it to 5000 by authenticating. The rate limit should not pose a problem, but it is your responsibility to ensure you don't get locked out hours before the assignment is due.
Start by picking a GitHub repository you know. You should pick a medium-sized public repository, maybe one you have made contributions to. You can take a look at the Caleydo repository as an example. You should not pick a large, popular repository (we'll be looking at those in Problem 3).
The GitHub website provides a user-friendly way to look at the data. On GitHub, each user has a profile page (example). Carefully explore this page and its interactive features. As you may have noticed, the page features a central calendar map showing the user's activity. Below is a capture of such a visualization:
Open a Github profile and use your browser's developer tools to locate the div
with the id
of contributions-calendar
. Its first child div
has a data-url
attribute, which points to the raw data used by the visualization. It is a univariate time series and below is a sample of the data:
[["2013/02/03",0],["2013/02/04",0],["2013/02/05",0],["2013/02/06",0],["2013/02/07",0],["2013/02/08",0],["2013/02/09",0],["2013/02/10",0]
Take a look at other visualizations from GitHub's website:
- Contributors to a repository
- Commits Activity
- Code Frequency
- Punch card
- Pulse of a repository
For each of the 5 visualizations listed above plus the calendar map, answer the following questions:
- Who is the audience? (e.g. project manager, contributor, project user, visitor, etc.)
- What data have been used? How can you get the data using the GitHub API? (Note that it can be the combination of multiple queries and their processing).
- Those visualizations are updated over time. What happens if suddenly a contributor pushes many commits in a short time interval? How would you address this particular issue?
GitHub's website features a visualization called the GitHub Network Graph Visualizer (example). It aims to visualize the often complex interactions within the repositories, showing the relationships between individual commits and branches. The graph construction is a bit complex, but can be simplified as follows:
- Each node is a commit, connected to its successor, using an arrow link.
- Commits are spatially grouped by users, and colored by branch.
- Links connecting the nodes try to prevent crossing as best as they can.
Here is a screen capture of the Network Graph Visualizer:
Looking at the network graph, answer the same questions as above, plus:
- What is the role of interaction for this visualization? Would a static graph have been sufficient?
- What happens if many new developers suddenly join the project and push commits for the first time? How would you preserve the graph's readability in such a situation?
Use the file simple_graph.html as a template for your code. Your resulting code after completing the problems should be put in a new file called simple_graph_answer.html
We are now going to use the GitHub API to create a graph data structure of commit data and produce a similar visualization to the Network Graph Visualizer.
We provided you with several basic graph visualization functions in the file simple_graph.html. One of them is a basic force directed layout, generated from a graph constructed with random nodes and link connections. Such a layout is essentially a physics simulation, and node position is determined by iteratively trying to satisfy specified parameters (e.g., gravity, friction, etc.). See some examples.
We provided other layout and visual encoding functions (see this gist):
- Layouts:
- Random (each point has a random position in a bounded area)
- Self-organized using a force-directed layout
- Radial from a pie layout, sorted by category
- Linear (projection of nodes on an axis)
- Retinal variables
- Filling color of the nodes
- Size of the nodes
Let's re-construct the GitHub Network Graph Visualizer using simple_graph.html as follows:
-
Choose any repository, as long as it meets the following three conditions: (1) it contains commits from at least two users and has at least two branches, (2) it contains at least 30 commits from different users and branches, and (3) it is public. Because querying the API may result in partial data (e.g., not all commits are visible) it is perfectly fine if there are missing nodes and connections. In general, the diversity of the commits is more important than the quantity. For this problem you may only use a subset of all the commits, between 30 and 100 commits (but you can select more if you want).
-
Use
d3.json
to fetch commit data (example, see documentation) to use as input dataset to the graph. You can find all the branches of a repository, or you can directly query the commits by branch using the URL, as demonstrated in the linked examples. If you experience delays or calls limits download the commits and use them as an offline json file. You may have to call the API serveral times, so you will have to make sure that you create the visualization only when all the data has been retrieved. Here is a discussion on how to do that. For problem 2 you do not need to incude forks, but you may use them if you want to. -
Populate the provided graph data structure (
{nodes:[], links:[]}
) with commit data. Each node represents a commit with a unique id, each link points to theparents
which is an attribute of the commit. Note there may not be parents or there can be multiple (e.g., if the commit is a merge of multiple branches). Add all metadata of a commit to the node. Be careful: some attributes are keywords reserved for the layout function (e.g.,x
andy
) and you can't use them as variable names for metadata. -
The GitHub Network Graph Visualizer layout is a linear layout. Extend the provided linear layout with two scales to display commits on the axis. The first scale should be index-based and use equal intervals between nodes. The second scale should use time scales, where the position reflects absolute time. Add a radio button and labels to switch from one to another.
-
Add SVG markers to show the link direction (example) and add labels for branches.
-
Color the nodes by branch, and when hovering over nodes, emphasize the current branch and show some details about the node itself (e.g., a tooltip with the node characteristics).
-
As you may have noticed, links connecting commits are not straight lines in the GitHub network graph. Because SVG
<line>
elements are not a flexible way to draw curves, switch to using<path>
elements. You'll be able to add control points to shape the curve as you wish and choose the right interpolation function. Here is an example of use of control points:
line([{x:d.source.x, y:d.source.y}, {x:d.source.x, y:d.source.y+offset_y}, {x:d.target.x, y:d.target.y+offset_y}, {x:d.target.x, y:d.target})
The following three screenshots show the results you may expect.
Regular index-based interval:
Time scale with hovered node (note: the mouse pointer is not visible, but hovers the unique node for user D):
Control points to improve links readability:
Naturally, you are free to change the aesthetics and use alternative node color, links and text styling.
Answer the questions in a file problem_3_answers.md, and include your sketch as a pdf file problem_3_sketch.pdf
Given your previous design critiques, your experience with the previous graph visualization implementation and the reading of the article cited below (Lee et al., 2006), answer the following questions:
-
Which graph-related tasks does an ideal GitHub Network Graph need to address?
-
Get back to the GitHub network visualization you implemented and test it with the following projects on GitHub: D3, jQuery and Bootstrap. There's a lot more data, but the interaction patterns of users are also very different. What do you notice about the three repositories?
-
How does this impact your graph?
-
How would you improve your visualization to address issues with the larger and more complex data?
Lee, B., Plaisant, C., Parr, C. S., Fekete, J. D., & Henry, N. (2006, May). Task taxonomy for graph visualization. In Proceedings of the 2006 AVI workshop on Beyond time and errors: novel evaluation methods for information visualization (pp. 1-5). ACM. (pdf)
Drawing on your observations from graphing different types of GitHub networks in the previous part, and from the reading, sketch an alternate design for the Github Network graph. Your work should rely on the same source data (i.e. commit history), but be creative — the focus of your visualization is up to you. You are, for example, welcome to aggregate nodes or use additional Github data.
Attach to this homework a picture/scan of your sketch, as well as a paragraph explaining the design decisions you made and how it addresses the limits of the GitHub Network Graph Visualizer you previously identified.
Answer the questions in a file problem_4_answers.md and provide your implementation in problem_4.html. You may include external Javascript or stylesheets.
You will now implement the sketch you've previously designed. Even though the sketching may have been as a group, you have to implement it yourself.
To get the full credits for this problem you need to address the following:
- Implement the sketch as an interactive visualization with D3.
- You are not expected to provide an exhaustive collection of features. Instead, focus on a few carefully designed features. We expect something roughly comparable in complexity to these two examples.
- Note that you are free to use D3 layouts. You are also free to use any D3 example from the gallery as a starting point, but you should significantly improve the example.
- Briefly explain your technical choices. Your final visualization may differ significantly from you original sketch or only implement a subset of it, but ensure you document the tradeoffs you faced and the reasoning behind your choices.
Also, you may have to change the graph data structure to a network depending on your implementation:
{
"name": "parent",
"children": [
{
"name": "sub-parent",
"children": [
{
You will be credited 0.5 bonus points for exceptionally original or novel graph designs and for thorough and clear code.