Preprocessing choices affect RNA velocity results for droplet scRNA-seq data

This repository contains the scripts used to generate the results in

C Soneson, A Srivastava, R Patro, MB Stadler: Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. bioRxiv (2020).

Repository structure

In this manuscript, five real scRNA-seq data sets generated by 10x Genomics technology were analyzed:

Dentate gyrus (Hochgerner et al. 2018) - GEO accession GSE95315
Pancreas (Bastidas-Ponce et al. 2019) - GEO accession GSE132188
OldBrain (Ximerakis et al. 2019) - GEO accession GSE129788 (sample accession GSM3722108)
PFC (Bhattacherjee et al. 2019) - GEO accession GSE124952 (sample accession GSM3559979)
Spermatogenesis (Hermann et al. 2018) - GEO accession GSE109033 (sample accession GSM2928341)

For each of these data sets, the corresponding subfolder of this repository contains the Makefile that was used to run the analysis. The R and Python scripts called in these Makefiles are provided in the scripts/ folder.

Data preprocessing

The downloaded data was preprocessed as follows before being used in the evaluation:

Dentate gyrus

FASTQ files for individual cells from the p12 and p35 time points were downloaded (accession date November 23, 2019) and merged using the GSE95315_download.sh script.
The sample annotation file (cells_kept_in_scvelo_exampledata.csv) was obtained via the scVelo Python package (v0.1.24):

import scvelo as scv
adata = scv.datasets.dentategyrus()
adata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")

Pancreas

FASTQ files for the E15.5 sample were downloaded from ENA (accession date November 24, 2019)
The FASTQ files were subsequently renamed, replacing _X with S1_L001_RX_001 (for X = 1,2)
The sample annotation file (cells_kept_in_scvelo_exampledata.csv) was obtained via the scVelo Python package (v0.1.24):

import scvelo as scv
aadata = scv.datasets.pancreatic_endocrinogenesis()
aadata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")

OldBrain

The BAM file with the reads was downloaded (accession date September 18, 2020) from SRA
FASTQ files were extracted from the BAM file using bamtofastq v1.1.2:

bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 OX1X.bam FASTQtmp

## Concatenate FASTQ files from all flowcells and lanes
mkdir -p FASTQ
cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz > \
FASTQ/OX1X_S1_L001_I1_001.fastq.gz

cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz > \
FASTQ/OX1X_S1_L001_R1_001.fastq.gz

cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz > \
FASTQ/OX1X_S1_L001_R2_001.fastq.gz

The sample annotation file (GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt) was downloaded from the GEO record. It was processed to retain only cells from sample 37 using the following R code, generating the final cells_in_sample37.csv file.

suppressPackageStartupMessages({
  library(dplyr)
})

csv <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt", 
                  header = FALSE, as.is = TRUE, skip = 2)
colnames(csv) <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt",
                            header = FALSE, as.is = TRUE)[1, ]

## Only Sample 37
csv <- csv %>% dplyr::filter(grepl("Aging_mouse_brain_portal_data_37", NAME)) %>%
  dplyr::mutate(index = gsub("Aging_mouse_brain_portal_data_37_", "", NAME)) %>%
  dplyr::rename(clusters = cell_classes,
                clusters_coarse = cluster) %>%
  dplyr::select(index, clusters_coarse, clusters, everything())

write.table(csv, file = "cells_in_sample37.csv",
            row.names = FALSE, col.names = TRUE,
            sep = ",", quote = FALSE)

PFC

The BAM file with the reads was downloaded (accession date December 19, 2019) from SRA
FASTQ files were extracted from the BAM file using bamtofastq v1.1.2:

bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 PFC_Sample2.bam FASTQtmp

## Concatenate FASTQ files from all flowcells and lanes
mkdir -p FASTQ
cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_I1_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_I1_001.fastq.gz

cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R1_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_R1_001.fastq.gz

cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R2_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_R2_001.fastq.gz

The sample annotation file (GSE124952_meta_data.csv) was downloaded from the GEO record. It was processed to retain only cells from PFCsample2 using the following R code, generating the final cells_in_pfcsample2.csv file:

suppressPackageStartupMessages({
  library(dplyr)
})

csv <- read.delim("GSE124952_meta_data.csv", header = TRUE, as.is = TRUE, sep = ",")

## Only Sample2
csv <- csv %>% dplyr::filter(Sample == "PFCSample2") %>%
  dplyr::mutate(Sample = gsub("Sample", "sample", Sample)) %>%
  dplyr::mutate(index = gsub("PFCSample2_", "", X)) %>%
  dplyr::rename(clusters = CellType,
                clusters_coarse = L2_clusters) %>%
  dplyr::select(index, clusters_coarse, clusters, everything())

write.table(csv, file = "cells_in_pfcsample2.csv",
            row.names = FALSE, col.names = TRUE,
            sep = ",", quote = FALSE)

Spermatogenesis

The BAM file from the AdultMouse_Rep3 sample was downloaded (accession date January 23, 2020) from SRA.
FASTQ files were extracted from the BAM file using bamtofastq v1.1.2:

bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 AdultMouse_Rep3_possorted_genome_bam.bam FASTQtmp
mkdir -p FASTQ
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_I1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_I1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R2_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R2_001.fastq.gz

The sample annotation file (spermatogenesis_loupe_celltypes_repl3.csv) was obtained by downloading the loupe file Mouse Unselected Spermatogenic cells.cloupe from here, loading it into the 10x Genomics Loupe browser and exporting the "Cell Type" labels. The resulting file was subset to only include replicate 3 and cell types with at least 5 cells, using the following R code:

suppressPackageStartupMessages({
  library(dplyr)
})

csv <- read.delim("Spermatogenesis_Loupe_CellTypes.csv", 
                  header = TRUE, as.is = TRUE, sep = ",")

## Only Replicate 3
csv <- csv %>% dplyr::filter(grepl("-3", Barcode)) %>%
  dplyr::mutate(index = gsub("-3", "", Barcode)) %>%
  dplyr::mutate(clusters = Cell.Types,
                clusters_coarse = Cell.Types) %>%
  dplyr::select(index, clusters_coarse, clusters)

## Remove rare cell types (<5 cells)
tbl <- table(csv$clusters)
kp <- names(tbl)[tbl >= 5]
csv <- csv %>% dplyr::filter(clusters %in% kp)

write.table(csv, file = "spermatogenesis_loupe_celltypes_repl3.csv",
            row.names = FALSE, col.names = TRUE,
            sep = ",", quote = FALSE)

Reference generation

All reference files were obtained from GENCODE, mouse release M21. In addition to the reference files and indices generated in the respective Makefiles, the following indices were prepared:

CellRanger index

The CellRanger index was generated as follows (with CellRanger v3.0.2):

cellranger mkref --nthreads 64 \
--genome=cellranger3_0_2_GRCm38.primary_assembly_gencodeM21_spliced \
--fasta=GRCm38.primary_assembly.genome.fa \
--genes=gencode.vM21.annotation.gtf \
--ref-version=GRCm38.primary_assembly_gencodeM21_spliced

STAR index

The STAR index used for quantification was generated using STAR v 2.7.3a:

mkdir star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150
STAR \
  --runThreadN 48 \
  --runMode genomeGenerate \
  --genomeDir ./star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150 \
  --genomeFastaFiles GRCm38.primary_assembly.genome.fa \
  --genomeSAindexNbases 14 \
  --genomeChrBinNbits 18 \
  --sjdbGTFfile gencode.vM21.annotation.gtf \
  --sjdbOverhang 150

Transcript-to-gene mapping

Finally, the transcript-to-gene mapping files (in .rds and .txt format) were generated with R v3.6/Bioconductor release 3.9 as follows:

suppressPackageStartupMessages({
  library(rtracklayer)
  library(dplyr)
})

entrez <- read.delim("gencode.vM21.metadata.EntrezGene", header = FALSE, as.is = TRUE) %>%
  setNames(c("transcript_id","entrez_id"))
gtf <- rtracklayer::import("gencode.vM21.annotation.gtf")
gtftx <- subset(gtf, type == "transcript")
gtfex <- subset(gtf, type == "exon")

df <- data.frame(gtftx, stringsAsFactors = FALSE) %>%
  dplyr::select(transcript_id, seqnames, start, end, strand, source, 
                gene_id, gene_type, gene_name, level, havana_gene, transcript_type,
                transcript_name, transcript_support_level, tag, havana_transcript) %>%
  dplyr::left_join(data.frame(gtfex, stringsAsFactors = FALSE) %>%
                     dplyr::group_by(transcript_id) %>% 
                     dplyr::summarize(transcript_length = sum(width)),
                   by = "transcript_id") %>%
  dplyr::left_join(entrez, by = "transcript_id")

dfsub <- df %>% dplyr::select(transcript_id, gene_id)

write.table(dfsub, file = "gencode.vM21.annotation.tx2gene.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE, 
            col.names = FALSE)
saveRDS(dfsub, file = "gencode.vM21.annotation.tx2gene.rds")
saveRDS(df, file = "gencode.vM21.annotation.tx2gene_full.rds")

csoneson/rna_velocity_quant