/DoubletDecon

A tool for removing doublets from single-cell RNA-seq data

Primary LanguageR

DoubletDecon

Deconvoluting doublets from single-cell RNA-sequencing data

logo

See our Cell Reports paper for more information on DoubletDecon. Also see our bioRxiv for an older description of the algorithm.

Updates - Version 1.1.3 : November 6th, 2019

  • NEW! Integrated ICGS2_to_ICGS1() is now available to support input files from ICGS version 2. You should not have to make any changes in your DoubletDecon workflow to use ICGS version 2 instead of ICGS version 1
  • Improved_Seurat_Pre_Process() now loads dplyr at the beginning of the fuction (thanks chansigit for the feedback!)
  • Fixed bugs in Remove_Cell_Cycle() that was preventing it from running on certain datasets

Updates - Version 1.1.2 : September 5th, 2019

  • NEW! Improved_Seurat_Pre_Process() is now available to replace Seurat_Pre_Process() for those who would prefer to work directly with a Seurat Object as input instead of individual files saved from a Seurat workflow. Workflows following the protocol found at https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html, from the provided script (seurat-3.0.R), or similar will be sufficent for this new function.
  • Resolved compatibility issues with Seurat version 3

Updates - Version 1.1.1 : May 29th, 2019

  • Change default for only50 to FALSE (from TRUE) to reflect best practices in running DoubletDecon

Updates - Version 1.1.0 : March 26th, 2019

  • NEW! DoubletDecon UI available in the GitHub repository (requires R3.5.0 or later and RStudio with 'shiny' package installed)
  • NEW! Improved Rescue step, Pseudo_Marker_Finder, now uses parallel processing and data chunking to improve speed and memory efficency. Results remain the same with the exception of no longer saving p-values (future release)
  • Remove downsample and sample_num
  • Upped num_doubs default value to 100 (from 30)
  • Require 'tidyr', 'R.utils', 'forrach', 'doParallel', 'stringr'. No longer require 'hopach'
  • Changed log_name_file to the value for filename, for compatibility with Windows operating systems
  • Automatically write processed data and groups files from Clean_Up_Data (even if write=FALSE) for use with new Rescue step (Pseudo_Marker_Finder)
  • Finalize switch to more granular doublet calls in the case of no Rescue step
  • Change name of cluster merging plot to "Cluster Merge" (from "Blacklist")
  • Fixed bug in Remove step when number of clusters equals 2 (Euclidean distance is used in the place of Pearson correlation)

Updates - Version 1.0.2 : January 9th, 2019

  • General bug fixes affecting final groups file and final expression file output.
  • Added user option to specify minimum number of unique genes to Rescue a putative doublet cluster (previously set at 4).
  • Speed up run time for users who do not use the Rescue step (PMF=FALSE).
  • Remove requirement for 'as.color' function.

Updates - Version 1.0.1 : December 26th, 2018

  • Additional "Remove" step option to create synthetic doublet centroids with 30%/70% and 70%/30% parent cell contribution instead of simply 50%/50% (only50=FALSE).
  • Heatmap generation corrected for large datasets (>5000 cells).
  • "Rescue" step modification from t-tests for all clusters to ANOVA with Tukey post-hoc test in only putative doublet clusters. Minimum of 4 unique genes as hardcoded default.
  • "Rescue" step now allows for sampling of clusters evenly or proportional to cluster size when using the full expression matrix.
  • Hopach removed as a "Recluster" option; does not work with improved "Rescue" step. Subsequently removed the DeconCalledFreq table as written and returned output.
  • Log file is generated with a unique ID for each run of DoubletDecon.
  • Catch error in mcl() function and quits DoubletDecon with warning to choose a different rhop value.
  • Synthetic doublet deconvolution values output for quality control (Synth_doublet_info)

Installation

Run the following code to install the package using devtools:

if(!require(devtools)){
  install.packages("devtools") # If not already installed
}
devtools::install_github('EDePasquale/DoubletDecon')

Dependencies

DoubletDecon requires the following R packages:

  • DeconRNASeq
  • gplots
  • dplyr
  • MCL
  • clusterProfiler
  • mygene
  • tidyr
  • R.utils
  • foreach
  • doParallel
  • stringr
  • Seurat (for Improved_Seurat_Pre_Process only)

These can be installed with:

source("https://bioconductor.org/biocLite.R")
biocLite(c("DeconRNASeq", "clusterProfiler", "hopach", "mygene", "tidyr", "R.utils", "foreach", "doParallel", "stringr"))
install.packages("MCL")

Additionally, the use of the cell cycle removal option requires an internet connection.

Usage

Seurat data only:

Improved_Seurat_Pre_Process(seuratObject, num_genes=50, write_files=FALSE)

Arguments

Value

  • newExpressionFile - Seurat expression file in ICGS format (ICGS genes)
  • newFullExpressionFile - Seurat expression file in ICGS format (all genes)
  • newGroupsFile - Groups file ICGS format
```javascript Seurat_Pre_Process(expressionFile, genesFile, clustersFile) ```

Arguments

  • expressionFile: Normalized expression matrix or counts file as a .txt file (expression from Seurat's NormalizeData() function)
  • genesFile: Top marker gene list as a .txt file from Seurat's top_n() function
  • clustersFile: Cluster identities as a .txt file from Seurat object @ident

Value

  • newExpressionFile - Seurat expression file in ICGS format (used as 'rawDataFile')
  • newGroupsFile - Groups file ICGS format (used as 'groupsFile')

Seurat and ICGS data:

Main_Doublet_Decon(rawDataFile, groupsFile, filename, location,
  fullDataFile = NULL, removeCC = FALSE, species = "mmu", rhop = 1,
  write = TRUE, PMF = TRUE, useFull = FALSE, heatmap = TRUE, centroids=FALSE, num_doubs=100, 
  only50=FALSE, min_uniq=4)

Arguments

  • rawDataFile: Name of file containing ICGS or Seurat expression data (gene by cell)
  • groupsFile: Name of file containing group assignments (3 column: cell, group(numeric), group(numeric or character))
  • filename: Unique filename to be incorporated into the names of outputs from the functions.
  • location: Directory where output should be stored
  • fullDataFile: Name of file containing full expression data (gene by cell). Default is NULL.
  • removeCC: Remove cell cycle gene cluster by KEGG enrichment. Default is FALSE.
  • species: Species as scientific species name, KEGG ID, three letter species abbreviation, or NCBI ID. Default is "mmu".
  • rhop: x in mean+x*SD to determine upper cutoff for correlation in the blacklist. Default is 1.
  • write: Write output files as .txt files. Default is TRUE.
  • recluster: Recluster deconvolution classified doublets and non-doublets seperately using hopach or deconvolution classifications.
  • PMF: Use step 3 (unique gene expression) in doublet determination criteria. Default is TRUE.
  • useFull: Use full gene list for PMF analysis. Requires fullDataFile. Default is FALSE.
  • heatmap: Boolean value for whether to generate heatmaps. Default is TRUE. Can be slow to datasets larger than ~3000 cells.
  • centroids: Use centroids as references in deconvolution instead of the default medoids.
  • num_doubs: The user defined number of doublets to make for each pair of clusters. Default is 100.
  • only50: use only synthetic doublets created with 50%/50% mix of parent cells, as opposed to the extended option of 30%/70% and 70%/30%, default is FALSE.
  • min_uniq: minimum number of unique genes required for a cluster to be rescued, default is 4.

Value

  • data_processed = new expression file (cleaned).
  • groups_processed = new groups file (cleaned).
  • PMF_results = pseudo marker finder t-test results (gene by cluster).
  • DRS_doublet_table = each cell and whether it is called a doublet by deconvolution analysis.
  • DRS_results = results of deconvolution analysis (cell by cluster) in percentages.
  • Decon_called_freq = percentage of doublets called in each cluster by deconvolution analysis.
  • Final_doublets_groups = new groups file containing only doublets.
  • Final_nondoublets_groups = new groups file containing only non doublets.
  • Synth_doublet_info = synthetic doublet deconvolution values output for quality control.

Example

Data for this example can be found in this GitHub repository. Examples are given for both Seurat_Pre_Process() and Improved_Seurat_Pre_Process(), though the latter is prefered if using Seurat 3.

location="/Users/xxx/xxx/" #Update as needed 

<s>
#Seurat_Pre_Process()
expressionFile=paste0(location, "counts.txt")
genesFile=paste0(location, "Top50Genes.txt")
clustersFile=paste0(location, "Cluster.txt")
newFiles=Seurat_Pre_Process(expressionFile, genesFile, clustersFile)
</s>

#Improved_Seurat_Pre_Process()
seuratObject=readRDS("seurat.rds")
newFiles=Improved_Seurat_Pre_Process(seuratObject, num_genes=50, write_files=FALSE)

filename="PBMC_example"
write.table(newFiles$newExpressionFile, paste0(location, filename, "_expression"), sep="\t")
write.table(newFiles$newFullExpressionFile, paste0(location, filename, "_fullExpression"), sep="\t")
write.table(newFiles$newGroupsFile, paste0(location, filename , "_groups"), sep="\t", col.names = F)

results=Main_Doublet_Decon(rawDataFile=newFiles$newExpressionFile, 
                           groupsFile=newFiles$newGroupsFile, 
                           filename=filename, 
                           location=location,
                           fullDataFile=NULL, 
                           removeCC=FALSE, 
                           species="hsa", 
                           rhop=1.1, 
                           write=TRUE, 
                           PMF=TRUE, 
                           useFull=FALSE, 
                           heatmap=FALSE,
                           centroids=TRUE,
                           num_doubs=100, 
                           only50=FALSE,
                           min_uniq=4)