A fictitious investment bank, Accountability Accounting, is exploring a new cryptocurrency investment portfolio for its clients. The purpose of this project is to report on the cryptocurrencies on the trading market and group them into a classification system for the new investment.
Unsupervised learning is utilized for this dataset in order to process the data to fit machine learning models. A clustering algorithm is used and visualizations are created.
Import Modules
# Initial imports
import pandas as pd
import hvplot.pandas
from path import Path
import plotly.express as px
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
The original dataset includes the column "IsTrading" which tells us whether the cryptocurrency is currently being traded.
We want to include only cryptocurrencies that are currently active, so we keep only the records where "IsTrading" == True
Next, after removing any rows with null values, we want to keep only the records of cryptocurrencies where coins have been mined. Here I used the .loc
method to keep records with column "TotalCoinsMined" value greather than 0.
no_null_df.loc[no_null_df['TotalCoinsMined'] >0]
Then after dropping the "IsTrading" column, we create a new dataframe that holds only the "CoinName" which we will come back to later. Next drop the "CoinName" from the original dataframe since it will not be used in the clustering algorithm.
We are left with four columns, two of which have string-type data, "Algorithm", and "ProofType". For these two columns, we use the .get_dummies()
method to make the values numeric.
from sklearn.preprocessing import StandardScaler
Now that all the columns have numeric values, the final preprocessing step is to standardize the data with StandardScaler from scikit-learn so they can be fit into machine learning models. This brings the values of columns to look more or less like standard normally distributed data.
from sklearn.decomposition import PCA
Instantiate PCA
pca = PCA(n_components = 3)
Previously, .get_dummies() took 2 columns with string-type data and created 96 columns with numeric indicator variables. The PCA process will now reduce a total of 98 columns down to 3 principal components which are just the three main dimensions of variation that contain most of the information in the original dataset. Reducing the dataset to 3 components from 98 variables allows the machine learning models to speed up the algorithms when the number of input features is too high.
-
Run the model on the data
crypto_pca = pca.fit_transform(crypto_scaled)
-
Create a new dataframe with the 3 principal components
from sklearn.cluster import Kmeans
- Find the best value of K using elbow curve.
- Use the principal components data with K-means algorithm with a K value of 4, (n_clusters = 4), where the direction shifts in the curve.
- 3D Scatter with Clusters
- 2D Scatter with "TotalCoinsMined" and "TotalCoinSupply"
End of Project