multivariate-weather-data-clustering for HESC branch

Download

There are three ways to Download and Manage the MWDC package:

1 - Use GitHub Desktop (Recomended)

2 - Use command line:

 

*Because the repository is private the command line method is not Recomended.

3 - Download the .zip file and use it.

4 - On Google Colab use the command below.

## Installation

#### 1. On PC

To install the package you need to create an environment using [pip](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) or [conda](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

##### Conda environment setup
```bash
conda create -n mwdc pandas numpy xarray netCDF4 matplotlib scikit-learn scipy dask
conda activate mwdc

After that just clone this repository and install the setup.py file inside it.

 cd multivariate-weather-data-clustering
 python setup.py install

Note: If you are using macOS, you should use python3 setup.py install instead.

2. On Google Colab

After cloning the repository just run the command below to install it.

 %cd multivariate-weather-data-clustering
 !python setup.py install

Usage

To use the functions you just need to import them from MWDC. Modules could be imported either seperately or all together.

from mwdc import *

## or ##

from mwdc.preprocessing import preprocessing
from mwdc.evaluation import st_evaluation
from mwdc.visualization import visualization

Example:

trans_data = preprocessing.datatransformation(data)

Modules Documentation

preprocessing

Functions Description
transformddaily() Transformation function for Daily Data
transformdmock() Transformation function for Mock Data
transformqm() Variable for Quater Map
datatransformation() Description in the Note below*
datanormalization() Input in this case will be the transformed pandas dataframe
null_fill() Function to input NaN values across variables
pca1() data is data to be input , n is the number of components
pcacomponents() Showing the proper number of components for pca by computing cumulative variance
data_preprocessing() Transforms the xArray input data into a 2D NumPy Array.

*Note: This function is used to transform the xarray dataset into a pandas dataframe where the dimension "time" would become the index of the DataFrame and, pairs of both dimensions "latitude" and "longitude" will become the columns for each variable

clustering

- DBscan

Functions Description
dbscanreal(x, eps1=0.5, min=5) eps1 for epsilon , min for minimum samples, x is for data input

- Agglomerative Clustering

Functions Description
`st_agglomerative(data, n, K, p=7, affinity, linkage) n=PCA components, K=number of clusters, p=truncate_mode.

- Kmeans

Functions Description
Kmeans(n_cluster).fit(xarray_data, PCA=(boolian), pass_trans_data=(boolian)) *
Kmeans(n_cluster).evaluate(z, PCA=(boolian), pass_trans_data=(boolian)) **

* This function fits the K-means model to the data that is passed to it.
Parameters that this function will accept are as follows:

  1. xarray_data = string of the name of the original xarray file
  2. PCA (bool) = whether or not PCA has to be applied. Default value is True.
  3. pass_trans_data (bool) = whether saved data has to be passed. If False, data will be transformed instantly. Default value is True.

** This function evaluates and assigns data points to clusters. Parameters that this function will accept are as follows:

  1. z = string of the name of the original xarray file.
  2. PCA (bool) = whether or not PCA has to be applied. Default value is True.
  3. pass_trans_data (bool) = whether saved data has to be passed. If False, data will be transformed instantly. Default value is True.

- evaluation

Functions Params
st_rmse() input,formed_clusters
st_corr() input,formed_clusters
st_calinski() input,formed_clusters
davies_bouldin() input, formed_clusters
compute_silhouette_score() X, labels,transformation=False, *, metric="euclidean", sample_size=None, random_state=None, **kwds

- visualization

Functions Params
visualization() data_file,cluster_filename,coast_file
make_Csv_cluster() label,name

* Parameters that visualization() will accept are as follows:

  1. data_file is the .nc file.
    - Example data_file = 'path/data.nc' It is the raw unprocessed data.
  2. cluster_filename is the csv file which contains clusterid and time_step.
    - Example cluster_filename = 'path/clusters.csv' # This file contains what cluster belongs to what date.
  3. coast_file = This file contains the data of how a coastline should look like in the result.
    - Example 'path/coast.txt'.

* Parameters that make_Csv_cluster() will accept are as follows:

  1. label contains the clusterids.
  2. Name is the file name that will generated eg:('test.csv').