This repository contains a header-only C++ library implementing the Barnes-Hut t-distributed stochastic neighbor embedding (t-SNE) algorithm (van der Maaten and Hinton, 2008; van der Maaten, 2014). t-SNE is a non-linear dimensionality reduction technique that enables visualization of complex datasets by placing each observation on a low-dimensional (usually 2D) map based on its neighbors. The general idea is that neighboring observations are placed next to each other, thus preserving the local structure of the dataset from its original high-dimensional form. The non-linear nature of the algorithm gives it the flexibility to accommodate complicated structure that is not possible in linear techniques like PCA.
The code here is derived from the C++ code in the Rtsne R package, itself taken from the 2014 paper. It has been updated to use more modern C++, along with some additional options to sacrifice accuracy for speed - see below for details.
Using this library is as simple as including the header file in your source code:
#include "qdtsne/qdtsne.hpp"
// Assuming `data` contains high-dimensional data in column-major format,
// i.e., each column is a observation and each row is a dimension. We
// initialize the t-SNE algorithm for a 2-dimensional embedding.
qdtsne::Options opt;
auto status = qdtsne::initialize<2>(nrow, ncol, data.data(), knncolle::VptreeBuilder(), opt);
// Starting the t-SNE algorithm from random 2-dimensional coordinates.
auto Y = qdtsne::initialize_random<2>(ncol);
status.run(Y.data());
You can change the parameters with the relevant setters:
opt.perplexity = 10;
opt.mom_switch_iter = 100;
You can also stop and start the algorithm:
auto status2 = qdtsne::initialize(nrow, ncol, data.data(), knncolle::VptreeBuilder(), opt);
status2.run(Y.data(), 200); // run up to 200 iterations
status2.run(Y.data(), 500); // run up to 500 iterations
See the reference documentation for more details.
van der Maaten (2014) proposed the use of the Barnes-Hut approximation for the repulsive force calculations in t-SNE.
The algorithm consolidates a group of distant points into a single center of mass, avoiding the need to calculate forces between individual points.
The definition of "distant" is determined by the theta
parameter, where larger values sacrifice accuracy for speed.
In qdtsne, we introduce an extra max_depth
parameter that bounds the depth of the tree used for the Barnes-Hut force calculations.
Setting a maximum depth of
- Define the bounding box/hypercube for our dataset and partition it along each dimension into
$2^m$ intervals, forming a high-dimensional grid. - In each grid cell, move all data points in that cell to the cell's center of mass.
- Construct a standard Barnes-Hut tree on this modified dataset.
- Use the tree to compute repulsive forces for each point
$i$ using its original coordinates.
The approximation is based on ignoring the distribution within each grid cell, which should be acceptable at large
We can go even further by using the center of mass for leaf_approximation
parameter.
Note that this only has an effect in max_depth
-bounded trees where multiple points are assigned to a leaf node.
Some testing indicates that both approximations can significantly speed up calculation of the embeddings.
Timings are shown below in seconds, based on a mock dataset containing 50,000 points (see tests/R/examples/basic.R
for details).
Strategy | Serial | Parallel (OpenMP, n = 4) |
---|---|---|
Default | 136 | 42 |
max_depth = 7 |
85 | 26 |
max_depth = 7 , leaf_approximation = true |
46 | 16 |
If you're already using CMake, you can add something like this to your CMakeLists.txt
:
include(FetchContent)
FetchContent_Declare(
qdtsne
GIT_REPOSITORY https://github.com/libscran/qdtsne
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(qdtsne)
And then:
# For executables:
target_link_libraries(myexe qdtsne)
# For libaries
target_link_libraries(mylib INTERFACE qdtsne)
find_package(libscran_qdtsne CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::qdtsne)
To install the library, use:
mkdir build && cd build
cmake .. -DQDTSNE_TESTS=OFF
cmake --build . --target install
By default, this will use FetchContent
to fetch all external dependencies.
If you want to install them manually, use -DQDTSNE_FETCH_EXTERN=OFF
.
See extern/CMakeLists.txt
to find compatible versions of each dependency.
If you're not using CMake, you can just copy the header files in include/
into some location that is visible to your compiler.
This requires the manual inclusion of the dependencies listed in extern/CMakeLists.txt
.
Most of the code in this repository is MIT-licensed (see the LICENSE
).
The exception is the code that was derived from the Rtsne R package, which in turn was taken from the van der Maaten (2014) paper.
Readers are referred to the relevant source files (SPTree.hpp
and Status.hpp
) for their licensing conditions.
van der Maaten, L.J.P. and Hinton, G.E. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605.
van der Maaten, L.J.P. (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15, 3221-3245.