Crypto_Clustering: A Jupyter Notebook repository from StrawhatRA

Clustering Crypto

Cryptocurrencies coins by Worldspectrum | Free License

Background

You are a Senior Manager at the Advisory Services team on a Big Four firm. One of your most important clients, a prominent investment bank, is interested in offering a new cryptocurrencies investment portfolio for its customers, however, they are lost in the immense universe of cryptocurrencies. They ask you to help them make sense of it all by generating a report of what cryptocurrencies are available on the trading market and how they can be grouped using classification.

In this homework assignment, you will put your new unsupervivsed learning and Amazon SageMaker skills into action by clustering cryptocurrencies and creating plots to present your results.

You are asked to accomplish the following main tasks:

Data Preprocessing: Prepare data for dimension reduction with PCA and clustering using K-Means.
Reducing Data Dimensions Using PCA: Reduce data dimension using the PCA algorithm from sklearn.
Clustering Cryptocurrencies Using K-Means: Predict clusters using the cryptocurrencies data using the KMeans algorithm from sklearn.
Visualizing Results: Create some plots and data tables to present your results.
Optional Challenge: Deploy your notebook to Amazon SageMaker.

Files

crypto_clustering.ipynb

Instructions

Data Preprocessing

In this section, you will load the information about cryptocurrencies and perform data preprocessing tasks. You can choose one of the following methods to load the data:

Using the provided CSV file, create a Path object and read the file data directly into a DataFrame named crypto_df using pd.read_csv().
Using the following requests library, retreive the necessary data from the following API endpoint from CryptoCompare - https://min-api.cryptocompare.com/data/all/coinlist. HINT: You will need to use the 'Data' key from the json response, then transpose the DataFrame. Name your DataFrame crypto_df.

With the data loaded into a Pandas DataFrame, continue with the following data preprocessing tasks.

Keep only the necessary columns: 'CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','TotalCoinSupply'
Keep only the cryptocurrencies that are trading.
Keep only the cryptocurrencies with a working algorithm.
Remove the IsTrading column.
Remove all cryptocurrencies with at least one null value.
Remove all cryptocurrencies that have no coins mined.
Drop all rows where there are 'N/A' text values.
Store the names of all cryptocurrencies in a DataFrame named coins_name, use the crypto_df.index as the index for this new DataFrame.
Remove the CoinName column.
Create dummy variables for all the text features, and store the resulting data in a DataFrame named X.
Use the StandardScaler from sklearn to standardize all the data of the X DataFrame. Remember, this is important prior to using PCA and K-Means algorithms.

Reducing Data Dimensions Using PCA

Use the PCA algorithm from sklearn to reduce the dimensions of the X DataFrame down to three principal components.

Once you have reduced the data dimensions, create a DataFrame named pcs_df using as columns names "PC 1", "PC 2" and "PC 3"; use the crypto_df.index as the index for this new DataFrame.

You should have a DataFrame like the following:

Clustering Cryptocurrencies Using K-Means

In this section, you will use the KMeans algorithm from sklearn to cluster the cryptocurrencies using the PCA data.

Perform the following tasks:

Create an Elbow Curve to find the best value for k using the pcs_df DataFrame.
Once you define the best value for k, run the Kmeans algorithm to predict the k clusters for the cryptocurrencies data. Use the pcs_df to run the KMeans algorithm.
Create a new DataFrame named clustered_df, that includes the following columns "Algorithm", "ProofType", "TotalCoinsMined", "TotalCoinSupply", "PC 1", "PC 2", "PC 3", "CoinName", "Class". You should maintain the index of the crypto_df DataFrames as is shown bellow.

Visualizing Results

In this section, you will create some data visualization to present the final results. Perform the following tasks:

Create a 3D-Scatter using Plotly Express to plot the clusters using the clustered_df DataFrame. You should include the following parameters on the plot: hover_name="CoinName" and hover_data=["Algorithm"] to show this additional info on each data point.
Use hvplot.table to create a data table with all the current tradable cryptocurrencies. The table should have the following columns: "CoinName", "Algorithm", "ProofType", "TotalCoinSupply", "TotalCoinsMined", "Class"
Create a scatter plot using hvplot.scatter, to present the clustered data about cryptocurrencies having x="TotalCoinsMined" and y="TotalCoinSupply" to contrast the number of available coins versus the total number of mined coins. Use the hover_cols=["CoinName"] parameter to include the cryptocurrency name on each data point.

Optional Challenge

For the challenge section, you have to upload your Jupyter notebook to Amazon SageMaker and deploy it.

The hvplot and Plotly Express libraries are not included in the built-in anaconda environments, so for this challenge section, you should use the altair library instead.