/dsc-network-clustering-lab

Primary LanguageJupyter NotebookOtherNOASSERTION

Network Clustering - Lab

Introduction

In this lab you'll practice your clustering and visualization skills to investigate stackoverflow! Specifically, the dataset you'll be investigating examines tags on stackoverflow. With this, you should be able to explore some of the related technologies currently in use by developers.

Objectives

In this lab you will:

  • Make visualizations of clusters and gain insights about how the clusters have formed

Load the Dataset

Load the data from the 'stack-overflow-tag-network/stack_network_links.csv' file. For now, simply load the file as a standard pandas DataFrame.

# Your code here

Transform the Dataset into a Network Graph using NetworkX

Transform the dataset from a Pandas DataFrame into a NetworkX graph.

# Your code here

Create an Initial Graph Visualization

Next, create an initial visualization of the network.

# Your code here

Perform an Initial Clustering using k-clique Clustering

Begin to explore the impact of using different values of k.

# Your code here

Visualize The Clusters Produced from the K-Clique Algorithm

Level-Up: Experiment with different nx.draw() settings. See the draw documentation here for a full list. Some recommended settings that you've previewed include the position parameter pos, with_labels=True, node_color, alpha, node_size, font_weight and font_size. Note that nx.spring_layout(G) is particularly useful for laying out a well formed network. With this, you can pass in parameters for the relative edge distance via k and set a random_seed to have reproducible results as in nx.spring_layout(G, k=2.66, seed=10). For more details, see the spring_layout documentation here.

# Your code here
# Your code here

Perform an Alternative Clustering Using the Girvan-Newman Algorithm

Recluster the network using the Girvan-Newman algorithm. Remember that this will give you a list of cluster lists corresponding to the clusters that from removing the top $n$ edges according to some metric, typically edge betweenness.

# Your code here

Create a Visualization Wrapper

Now that you have an idea of how splintered the network becomes based on the number of edges removed, you'll want to examine some of the subsequent groups that gradually break apart. Since the network is quiet complex to start with, using subplots is not a great option; each subplot would be too small to accurately read. Create a visualization function plot_girvan_newman(G, clusters) which takes a NetworkX graph object as well as one of the clusters from the output of the Girvan-Newman algorithm above and plots the network with a unique color for each cluster.

Level-Up: Experiment with different nx.draw() settings. See the draw documentation here for a full list. Some recommended settings that you've previewed include the position parameter pos, with_labels=True, node_color, alpha, node_size, font_weight and font_size. Note that nx.spring_layout(G) is particularly useful for laying out a well formed network. With this, you can pass in parameters for the relative edge distance via k and set a random_seed to have reproducible results as in nx.spring_layout(G, k=2.66, seed=10). For more details, see the spring_layout documentation here.

def plot_girvan_newman(G, clusters):
    # Your code here 
    pass

Visualize the Various Clusters that Form Throughout the Girvan-Newman Algorithm

Use your function to visualize the various clusters that form throughout the Girvan-Newman algorithm as you remove more and more edges from the network.

# Your code here

Cluster Decay Rate

Create a visual to help yourself understand the rate at which clusters of this network formed versus the number of edges removed.

Level-Up: Based on your graphic, what would you predict is an appropriate number of clusters?

# Your code here

Choose a Clustering

Now that you have generated various clusters within the overall network, which do you think is the most appropriate or informative?

# Your code/response here

Summary

In this lab you practice using the k-clique and Girvan-Newman methods for clustering. Additionally, you may have also gotten a better sense of some of the current technological landscape. As you can start to see, network clustering provides you with powerful tools to further subset large networks into smaller constituencies allowing you to dig deeper into their particular characteristics.