This repository contains software for net-SNE, a neural netork-based visualization tool developed for single cell RNA sequencing datasets.
"Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks"
Hyunghoon Cho, Bonnie Berger, and Jian Peng
Cell Systems, 2018
-
Armadillo for matrix/vector operations (developed with version 8.2).
- (Optional) To speed up matrix multiplication via multi-threading, we recommend installing OpenBLAS before installing Armadillo (see Armadillo's README for more information). Many platforms come with pre-built OpenBLAS library. Armadillo's CMake installer will try to automatically detect whether OpenBLAS is available and link to it if it exists. You may need to explicitly provide the path if it is installed in a non-standard location.
-
Boost C++ Library for command-line parser and file system operations (developed with version 1.66.0). We require only the FileSystem, System, and ProgramOptions libraries. These can be selectively installed by calling
bootstrap.sh
with the argument--with-libraries=program_options,filesystem,system
before building Boost.
-
Update platform-specific settings in Makefile. Parameters to update/confirm are:
CXX
: C++ compilerBOOSTROOT
,BOOSTLIB
: Boost library paths (base directory for the package and library directory containing linkable .a/.so files, respectively)ARMAINC
,ARMALIB
: Armadillo library paths (directory containing header files and directory with linkable .a/.so files, respectively); only needed if Armadillo is installed in a non-standard location, in which caseARMALIB
should also be included in the system's search path for shared libraries (e.g.,LD_LIBRARY_PATH
on Linux) before running net-SNE.
-
Inside the net-SNE directory, run
make
. This will create the executablesRunNetsne
,RunBhtsne
, andComputeP
in a subdirectory namedbin
.
We provide three example MATLAB scripts for different use cases of net-SNE: example_run_basic.m
, example_run_crossdataset.m
, and example_run_millioncell.m
. See the comments in each script for more information. The following is a step-by-step description of example_run_basic.m
, which serves a good starting point:
-
Prepare dataset. In
example_data
directory we provide an example single-cell RNA-seq dataset from Pollen et al., Nature Biotechnology, 2014. The data filepollen.txt
contains a matrix in space-delimited text format, where each row is a gene (or a unique molecule) and each column is a sample (i.e., a cell). Each element represents a measure of expression (e.g., normalized read counts). Known subtypes of the cells are provided inpollen_labels.txt
. -
Preprocess. We provide a MATLAB script
prepare_input.m
for (optionally) performing dimensionality reduction via principal component analysis and saving the data in binary format for subsequent steps. In MATLAB, run:
X = dlmread('example_data/pollen.txt');
prepare_input(X', 'example_data/pollen_X.dat', 50, 1, 'example_data/pollen_pca.mat')
This will save the processed data with respect to top 50 principal components toexample_data/pollen_X.dat
. See the comments inprepare_input.m
for more information about the arguments. -
Compute input similarities. Like t-SNE, net-SNE learns the embedding by placing cells closer to each other in the visualization if they are more similar in the input data. Using
bin/ComputeP
, which invokes a subroutine from the original t-SNE implementation, we can construct a k-Nearest Neighbor (k-NN) graph approximation of the input similarity matrix (denoted P in t-SNE). In the command-line terminal, run:
bin/ComputeP --input-file example_data/pollen_X.dat --output-file example_data/pollen_P.dat
This saves the input similarity matrix toexample_data/pollen_P.dat
. To see the list of supported arguments, runbin/ComputeP --help
. -
Run net-SNE (or t-SNE). In the command-line terminal, run:
bin/RunNetsne --input-P example_data/pollen_P.dat --input-X example_data/pollen_X.dat --out-dir example_data/netsne --no-sgd
This outputs the final model and the embedding in theexample_data/netsne
directory. The embedding fileY_final.txt
is in the Armadillo text file format, which contains two lines of header followed by a matrix. Note the flag--no-sgd
makes net-SNE use batch gradient descent as the example dataset is relatively small. To see the list of supported parameters, runbin/RunNetsne --help
. To run t-SNE for comparison, run the following:
bin/RunBhtsne --input-P example_data/pollen_P.dat --out-dir example_data/bhtsne
-
Plot visualization. The following MATLAB code visualizes the final embeddings and color the points according to the known cluster labels in
pollen_labels.txt
:
Y = dlmread('example_data/netsne/Y_final.txt', '', 2, 0);
labels = dlmread('example_data/pollen_labels.txt');
scatter(Y(:,1), Y(:,2), 10, labels, 'filled')
Our software is based on the implementation of Barnes-Hut t-SNE developed by Laurens van der Maaten at Delft University of Technology, available at: https://github.com/lvdmaaten/bhtsne.
We are working on releasing wrappers in other programming languages (e.g. Python). If you would like an update when these become available, feel free to drop us a note at the email address below.
[2018/8/7] R wrapper is available at https://github.com/schwikowskilab/rNetSNE. Developed by Pierre Bost and Florian Specque.
Hoon Cho, hhcho@mit.edu