/hdbscan-suite

Clustering suite designed to run HDBSCAN on a dataset and perform statistical analysis on the results.

Primary LanguagePython

HDBSCAN Suite

Clustering suite designed to run HDBSCAN on a dataset multiple times with multiple parameters and perform statistical analysis on the results.

Description

For each run, HDBSCAN will be performed on the dataset with each parameter. The results for each run will be accumulated and stored in its respective directory.

Prerequisites

  • python3.5+
  • pip 10.0+

Installation

Navigate to desired directory and create a virtual environment. IMPORTANT: Make sure to set the default interpreter as python3.

virtualenv -p python3 --no-site-packages [desired environment name]

Navigate to environment directory and activate the environment.

cd [environment]
source bin/activate

Clone this repository into the environment.

git clone https://github.com/Vardominator/hdbscan-suite.git

Install necessary packages. Order is important.

pip install numpy
pip install -r requirements.txt

Usage

Save config.json.template as config.json.

cp config.json.template config.json

Using the table below, set the parameters in config.json as desired.

Parameter Description Example
runs Number of times to run HDBSCAN with set of parameters 50
data Path to dataset to be clustered "data/luteo-1796-1798.txt"
partition:column Column used to partition data 3
partition:start Starting value for partition (everything less is left out) 6000
sample Sample size of the dataset. Set 0 to use entire dataset post partition 12000
norm:method Desired method for normalization. Available methods: standard_score, feature_scale "feature_scale"
norm:columns Columns to be normalized (ex. [4,5,10]) [4,5]
range Columns to be included in the clustering, starting at 0 [4,12] (columns 4 through 12 will be used in clustering)
parameters:range If set of desired parameters are a range. Set to false if running with individual values true
parameters:option Cluster criterion, minimum cluster size or minimum sample size. Look below for more information. "min_cluster_size"
parameters:min Range or set of values to be used for each run. [2,10,1] will use parameters from 2 to 10 in steps of 1 if range is set to true. [2,5,10,30] will use parameters 2,5,10, and 30. [2,10,1]
threads Number of threads to use within HDBSCAN algorithm 4

Make sure environment is active.

source [environment]/bin/activate

Run the suite.

python hdbscan_suite.py

Results will be stored in RESULTS/[starting time and date]/*

Log will be stored in LOGS/[starting time and date].log

Resources

Open-source HDBSCAN extenstion to Python's scikit-learn machine learning library