/data-embedding-and-visualization

Visualization and embedding of large datasets using various Dimensionality Reduction (DR) techniques such as t-SNE, UMAP, PaCMAP & IVHD. Implementation of custom metrics to assess DR quality with complete explaination and workflow.

Primary LanguageJupyter NotebookMIT LicenseMIT

1. Introduction

Visualization makes it easier to understand and notice dependencies in the high dimensionality data that are not trivial to capture and perceive. It is an inseparable, far-reaching, and effectual concept of data analysis or its initial recognition, but also an autonomous tool and dextrous field of machine learning. Visualization allows checking whether there are groups of similar observations forming clusters and finally gain more priceless intuition and understanding about data. In the case of multi and highdimensional ones, it is necessary to reduce their dimensions to at most three. The relationships in data are often non-linear, which rules out methods like PCA regarding separation quality. Therefore, it is required to use Manifold Learning techniques to discover the surface (manifold) on which the data is distracted and make reasonable projections into a space with the desired dimensionality. This project aims to analyze and visualize the MINST, 20 News Groups, and RCV Reuters datasets using methods such as t-SNE, UMAP, ISOMAP, PaCMAP and IVHD. Therefore, the particular motivation is to show the concept of high-dimensional data visualization, assess multiple data embedding techniques, and highlight potential comparative criteria of data separation quality.

2. VISKIT

1. Configuration and Setup

Viskit Repository and README

git clone https://gitlab.com/bminch/viskit.git
docker build -t viskit -f Dockerfile .
docker run -it viskit /bin/bash

2. Graphs

Graphs are required by VisKit. For this project, they can be downloaded either manually or automatically.

source /utils/download_graphs.sh

Graphs location on Google Drive:
mnist_cosine.bin
mnist_euclidean.bin
reuters_cosine.bin
reuters_euclidean.bin
tng_cosine.bin
tng_euclidean.bin

3. Usage documentation

Provide dataset (without labels; path_to_dataset_file), labels (path_to_labels_file) as separate csv files and graph file ({path_to_graph_file}). Visualization text file will be saved to specified path (path_to_visualization).

cd /opt/viskit/viskit_offline
./viskit_offline {path_to_dataset_file} {path_to_labels_file} {path_to_graph_file} {path_to_visualization} 2500 2 1 1 0 0 0 "force-directed"
./viskit_offline {path_to_dataset_file} {path_to_labels_file} {path_to_graph_file} {path_to_visualization}

4. Usage examples

cd /opt/viskit/viskit_offline
./viskit_offline "./datasets/mnist_data.csv" "./labels/mnist_labels.csv" "./graphs/mnist.bin" ./visualization.txt 2500 2 1 1 0 0 0 "force-directed"
./viskit_offline "./datasets/mnist_data.csv" "./labels/mnist_labels.csv" "./graphs/mnist.bin" ./visualization.txt

3. Metrics

Metrics are used to asses and compare quality of dimensionality reduction techniques. Two major aspects are worth to include during assesment - the local and global quality of separation.

Implemented Metrics:

  1. Distance matrix-based metric
  2. Distance matrix-based metric with KMeans optimization
  3. KMeans extension of distance matrix based metric
  4. Thrustworthiness-based metric
  5. Spearman correlation-based metric
  6. KNN Gain & DR Quality
  7. Sheppard Diagram
  8. Co-ranking matrix-based metric

4. [Appendix] Introduction to Dimenstionality Reduction

Jupyter notebooks that covers basic and advanced issues regarding the visualization of large data sets and Dimensionality Reduction

  1. Principal Component Analysis
  2. Roulade projections using t-SNE and MDS
  3. f-MNIST and MNIST visualizations using t-SNE, UMAP and LargeVis
  4. Neural Networks hidden layers activations embedding

6. Authors

Mateusz Smendowski & Michał Grela