enhance: A Python repository from flo-compbio

ENHANCE: Accurate denoising of single-cell RNA-Seq data

This repository contains a Python implementation of the ENHANCE algorithm for denoising single-cell RNA-Seq data (Wagner et al., 2019).

The R implementation can be found in a separate repository.

Running ENHANCE from the command-line

Follow these instructions to run the Python implementation of ENHANCE from the command-line.

Install dependencies

Make sure you have Python 3 and the Python packages scikit-learn, pandas, and click installed. The easiest way to install Python 3 as well as these packages is to download and install Anaconda (select "Python 3.7 version").
Download the GitHub repository

Download ENHANCE, and extract the contents into a folder.

Test running the script

To run the script, change into the folder where you extracted the files, and run (on Linux/Mac):

python3 enhance.py --help

You should see the following output:

Usage: enhance.py [OPTIONS]

Options:
-f, --fpath TEXT            The input UMI-count matrix.
-o, --saveto TEXT           The output matrix.
--transcript-count INTEGER  The target median transcript count for
                            determining thenumber of neighbors to use for
                            aggregation.(Ignored if "--num-neighbors" is
                            specified.)
--max-neighbor-frac FLOAT   The maximum number of neighbors to use for
                            aggregation, relative to the total number of
                            cells in the dataset. (Ignored if "--num-
                            neighbors" is specified.)
--pc-var-fold-thresh FLOAT  The fold difference in variance required for
                            relevant PCs, relative to the variance of the
                            first PC of a simulated dataset containing only
                            noise.
--max-components INTEGER    The maximum number of principal components to
                            use.
--num-neighbors INTEGER     The number of neighbors to use for aggregation.
--sep TEXT                  Separator used in input file. The output file
                            will use this separator as well.  [default: \t]
--use-double-precision      Whether to use double-precision floating point
                            format. (This doubles the amount of memory
                            required.)
--seed INTEGER              Seed for pseudo-random number generator.
                            [default: 0]
--test                      Test if results for test data are correct.
--help                      Show this message and exit.

Make sure your expression matrix file is formatted correctly

By default, the script expects your expression matrix to be stored as a tab-separated plain-text file, with gene labels contained in the first column, and cell labels contained in the first row (the top-left "cell" in the matrix can either be empty or contain the first cell label). You can compress the text file using gzip. A properly formatted example dataset (data/pbmc-4k_expression.tsv.gz) is included in this repository.

If your file uses a separator other than the tab character, you must specify it by passing the --sep argument to the script. For example, if you're using comma-separated values (csv), pass --sep ,. This will also affect the separator used in the output file.
Run ENHANCE!

Let's say your (tab-separated) expression matrix file is called expression.tsv.gz, and you saved it in the same directory as the enhance.py script. Then, to ENHANCE, you would use:
```
python3 enhance.py -f expression.tsv.gz -o denoised_expression.tsv
```
This will produce a denoised matrix called denoised_expression.tsv.

Example

Running ENHANCE from the command-line, on the test dataset included in this repository (data/pbmc-4k_expression.tsv.gz):

$ python3 enhance.py -f data/pbmc-4k_expression.tsv.gz -o denoised_pbmc-4k_expression.tsv

Output:

  [2019-06-04 18:51:16] INFO: Reading the expression matrix (6.4 MB) from "data/pbmc-4k_expression.tsv.gz"...
  [2019-06-04 18:51:26] INFO: The expression matrix contains 14854 genes and 4334 cells.
  [2019-06-04 18:51:26] INFO: Applying ENHANCE...
  [2019-06-04 18:51:26] INFO: Input matrix hash: 199bcb6f4b2fbd7e254bafb272df07e6
  [2019-06-04 18:51:26] INFO: The median transcript count of the matrix is 3478.5.
  [2019-06-04 18:51:26] INFO: Will perform denoising with k=58 (value was determined automatically based on a target transcript count of 200000).
  [2019-06-04 18:51:26] INFO: Determining the number of significant PCs...
  [2019-06-04 18:51:32] INFO: The number of significant PCs is 16.
  [2019-06-04 18:51:32] INFO: Aggregating cells...
  [2019-06-04 18:51:49] INFO: Removing noise using PCA...
  [2019-06-04 18:51:54] INFO: ENHANCE took 28.0 s.
  [2019-06-04 18:51:55] INFO: Denoised matrix hash: 9517e97621e500e357d9ecea9a36bb63
  [2019-06-04 18:51:55] INFO: Writing the denoised expression matrix to "denoised_pbmc-4k_expression.tsv"...
  [2019-06-04 18:53:21] INFO: File size: 711.7 MB.

The results are shown using UMAP below. (The UMAP result is included in this repository, under data/umap_result.tsv).

Changelog

We will note all changes to the code here.

6/3/2019 - Python implementation of ENHANCE released

This is the first release of ENHANCE (algorithm version "0.1"), which is the version we used to generate the results presented in our bioRxiv paper, including all benchmark analyses.

flo-compbio/enhance

ENHANCE: Accurate denoising of single-cell RNA-Seq data

Running ENHANCE from the command-line

Example

Changelog

6/3/2019 - Python implementation of ENHANCE released