In this lab you'll practice your clustering and visualization skills to investigate stackoverflow! Specifically, the dataset you'll be investigating examines tags on stackoverflow. With this, you should be able to explore some of the related technologies currently in use by developers.
In this lab you will:
- Make visualizations of clusters and gain insights about how the clusters have formed
Load the data from the 'stack-overflow-tag-network/stack_network_links.csv'
file. For now, simply load the file as a standard pandas DataFrame.
# Your code here
Transform the dataset from a Pandas DataFrame into a NetworkX graph.
# Your code here
Next, create an initial visualization of the network.
# Your code here
Begin to explore the impact of using different values of k.
# Your code here
Level-Up: Experiment with different
nx.draw()
settings. See the draw documentation here for a full list. Some recommended settings that you've previewed include the position parameterpos
,with_labels=True
,node_color
,alpha
,node_size
,font_weight
andfont_size
. Note thatnx.spring_layout(G)
is particularly useful for laying out a well formed network. With this, you can pass in parameters for the relative edge distance viak
and set arandom_seed
to have reproducible results as innx.spring_layout(G, k=2.66, seed=10)
. For more details, see the spring_layout documentation here.
# Your code here
# Your code here
Recluster the network using the Girvan-Newman algorithm. Remember that this will give you a list of cluster lists corresponding to the clusters that from removing the top
# Your code here
Now that you have an idea of how splintered the network becomes based on the number of edges removed, you'll want to examine some of the subsequent groups that gradually break apart. Since the network is quiet complex to start with, using subplots is not a great option; each subplot would be too small to accurately read. Create a visualization function plot_girvan_newman(G, clusters)
which takes a NetworkX graph object as well as one of the clusters from the output of the Girvan-Newman algorithm above and plots the network with a unique color for each cluster.
Level-Up: Experiment with different
nx.draw()
settings. See the draw documentation here for a full list. Some recommended settings that you've previewed include the position parameterpos
,with_labels=True
,node_color
,alpha
,node_size
,font_weight
andfont_size
. Note thatnx.spring_layout(G)
is particularly useful for laying out a well formed network. With this, you can pass in parameters for the relative edge distance viak
and set arandom_seed
to have reproducible results as innx.spring_layout(G, k=2.66, seed=10)
. For more details, see the spring_layout documentation here.
def plot_girvan_newman(G, clusters):
# Your code here
pass
Use your function to visualize the various clusters that form throughout the Girvan-Newman algorithm as you remove more and more edges from the network.
# Your code here
Create a visual to help yourself understand the rate at which clusters of this network formed versus the number of edges removed.
Level-Up: Based on your graphic, what would you predict is an appropriate number of clusters?
# Your code here
Now that you have generated various clusters within the overall network, which do you think is the most appropriate or informative?
# Your code/response here
In this lab you practice using the k-clique and Girvan-Newman methods for clustering. Additionally, you may have also gotten a better sense of some of the current technological landscape. As you can start to see, network clustering provides you with powerful tools to further subset large networks into smaller constituencies allowing you to dig deeper into their particular characteristics.