ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain
This repository contains code for ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain.
To get the submodules:
git pull
git submodule update --init
Each folder contains an implementation that we tested.
parchain
Our ParChain algorithms for complete linkage with Euclidean distance, Ward's linkage with Euclidean distance, average linkage with Euclidean distance and average linkage with squared Euclidean distance. This program is adapted from the problem based benchmark suite (PBBS).
multi-threaded-NN-chain
Jeon et al.'s parallel nearest-neighbor chain implementation. It supports average linkage with Euclidean distance.
fast_protein_cluster
fast_protein_cluster parallel HAC implementation that works for complete and average linkage with Euclidean distance metric.
The original code works for root-mean-square deviation (RMSD) and template modeling score (TM-score) measures.
fastcluster
fastcluster is a sequential HAC implementation in C++ with python interface.
scipy
and sklearn
can be installed from python.
althaus
Althaus et al.'s is a parallel HAC algorithm for complete linkage. We implemented it in C++.
Compiler:
- g++ = 7.5.0 with support for Cilk Plus
- python3 >= 3.6.9
- make
Python libraries:
- scipy >= 1.5.4
- sklearn >= 0.24.2
- pandas
- numpy
You can download our datasets here
For running time tests, we use numactl
. It can be installed using apt install numactl
.
For the memory tests, valgrind >= 3.17.0
is required. You can download here.
To run our scripts, the datasets need to be in the "datasets" folder under this folder.
The runtime/memory experiment result of our algorithm runs the following. The result files will be in outputs/outputs_{method}
and outputs/memo_outputs
.
You can modify datasets
and dims
in pc_exp.sh for the datasets to run.
./pc_exp.sh complete
./pc_exp.sh ward
./pc_exp.sh avg 32
./pc_exp.sh avgsq
./pc_exp_memo.sh
The runtime/memory experiment result of our algorithm runs the following. The result files will be in outputs/outputs_matrix
and outputs/memo_outputs
.
You can modify datasets
and dims
in pc_exp.sh for the datasets to run.
./pc_matrix_exp.sh complete
./pc_matrix_exp.sh ward
./pc_matrix_exp.sh avg
./pc_matrix_exp.sh avgsq
./pc_matrix_exp_memo.sh
The runtime/memory experiment result of our algorithm runs the following. The result files will be in outputs/outputs_matrix_range
and outputs/memo_outputs
.
You can modify datasets
and dims
in pc_exp.sh for the datasets to run.
./pc_matrix_range_exp.sh complete
./pc_matrix_range_exp.sh ward
./pc_matrix_range_exp.sh avg
./pc_matrix_range_exp.sh avgsq
./pc_matrix_exp_range_memo.sh
The runtime/memory experiment result of Jeon runs the following. The result files will be in outputs/outputs_jeon
and outputs/memo_outputs
.
./jeon_exp.sh
./jeon_exp_memo.sh
The runtime/memory experiment result of fastprotein runs the following. The result files will be in outputs/outputs_fastprotein
and outputs/memo_outputs
.
./fastprotein_exp.sh
./fastprotein_exp_memo.sh
The runtime/memory experiment result of scipy and sklearn runs the following. The result files will be in outputs/outputs_scipy
, outputs/outputs_sklearn
, outputs/outputs_fastcluster
and outputs/memo_outputs
.
You need to install Scipy, Sklearn, and fastcluster first.
Scipy and Sklearn and be installed using pip
. The fastcluster submodule in this repo includes some additional methods (linear memory nearest neighbor chain for Ward's and average linkage with squared Euclidean distance metric). You can refer to fastcluster/
for installation method.
./python_libs_exp.sh
./python_libs_exp_memo.sh
The runtime/memory experiment result of fastprotein runs the following. The result files will be in outputs/outputs_clink
and outputs/memo_outputs
.
./clink_exp.sh