This repository contains the scripts used to generate the results in
- C Soneson, A Srivastava, R Patro, MB Stadler: Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. bioRxiv (2020).
In this manuscript, five real scRNA-seq data sets generated by 10x Genomics technology were analyzed:
Dentate gyrus
(Hochgerner et al. 2018) - GEO accession GSE95315Pancreas
(Bastidas-Ponce et al. 2019) - GEO accession GSE132188OldBrain
(Ximerakis et al. 2019) - GEO accession GSE129788 (sample accession GSM3722108)PFC
(Bhattacherjee et al. 2019) - GEO accession GSE124952 (sample accession GSM3559979)Spermatogenesis
(Hermann et al. 2018) - GEO accession GSE109033 (sample accession GSM2928341)
For each of these data sets, the corresponding subfolder of this repository contains the Makefile that was used to run the analysis. The R and Python scripts called in these Makefiles are provided in the scripts/
folder.
The downloaded data was preprocessed as follows before being used in the evaluation:
- FASTQ files for individual cells from the p12 and p35 time points were downloaded (accession date November 23, 2019) and merged using the GSE95315_download.sh script.
- The sample annotation file (
cells_kept_in_scvelo_exampledata.csv
) was obtained via the scVelo Python package (v0.1.24):
import scvelo as scv
adata = scv.datasets.dentategyrus()
adata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")
- FASTQ files for the E15.5 sample were downloaded from ENA (accession date November 24, 2019)
- The FASTQ files were subsequently renamed, replacing
_X
withS1_L001_RX_001
(forX
= 1,2) - The sample annotation file (
cells_kept_in_scvelo_exampledata.csv
) was obtained via the scVelo Python package (v0.1.24):
import scvelo as scv
aadata = scv.datasets.pancreatic_endocrinogenesis()
aadata.obs.to_csv("cells_kept_in_scvelo_exampledata.csv")
- The BAM file with the reads was downloaded (accession date September 18, 2020) from SRA
- FASTQ files were extracted from the BAM file using bamtofastq v1.1.2:
bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 OX1X.bam FASTQtmp
## Concatenate FASTQ files from all flowcells and lanes
mkdir -p FASTQ
cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_I1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_I1_001.fastq.gz > \
FASTQ/OX1X_S1_L001_I1_001.fastq.gz
cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R1_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R1_001.fastq.gz > \
FASTQ/OX1X_S1_L001_R1_001.fastq.gz
cat FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HNGW3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HVN2NBGX2/bamtofastq_S1_L004_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L003_R2_001.fastq.gz \
FASTQtmp/s9_MissingLibrary_1_HW2M3BGX2/bamtofastq_S1_L004_R2_001.fastq.gz > \
FASTQ/OX1X_S1_L001_R2_001.fastq.gz
- The sample annotation file (
GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt
) was downloaded from the GEO record. It was processed to retain only cells from sample 37 using the following R code, generating the finalcells_in_sample37.csv
file.
suppressPackageStartupMessages({
library(dplyr)
})
csv <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt",
header = FALSE, as.is = TRUE, skip = 2)
colnames(csv) <- read.delim("GSE129788_Supplementary_meta_data_Cell_Types_Etc.txt",
header = FALSE, as.is = TRUE)[1, ]
## Only Sample 37
csv <- csv %>% dplyr::filter(grepl("Aging_mouse_brain_portal_data_37", NAME)) %>%
dplyr::mutate(index = gsub("Aging_mouse_brain_portal_data_37_", "", NAME)) %>%
dplyr::rename(clusters = cell_classes,
clusters_coarse = cluster) %>%
dplyr::select(index, clusters_coarse, clusters, everything())
write.table(csv, file = "cells_in_sample37.csv",
row.names = FALSE, col.names = TRUE,
sep = ",", quote = FALSE)
- The BAM file with the reads was downloaded (accession date December 19, 2019) from SRA
- FASTQ files were extracted from the BAM file using bamtofastq v1.1.2:
bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 PFC_Sample2.bam FASTQtmp
## Concatenate FASTQ files from all flowcells and lanes
mkdir -p FASTQ
cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_I1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_I1_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_I1_001.fastq.gz
cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R1_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R1_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_R1_001.fastq.gz
cat FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HJW35BCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNNFBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HMNTLBCXY/bamtofastq_S1_L002_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L001_R2_001.fastq.gz \
FASTQtmp/PFC_sample2_MissingLibrary_1_HNGGMBCXY/bamtofastq_S1_L002_R2_001.fastq.gz > \
FASTQ/PFCsample2_S1_L001_R2_001.fastq.gz
- The sample annotation file (
GSE124952_meta_data.csv
) was downloaded from the GEO record. It was processed to retain only cells from PFCsample2 using the following R code, generating the finalcells_in_pfcsample2.csv
file:
suppressPackageStartupMessages({
library(dplyr)
})
csv <- read.delim("GSE124952_meta_data.csv", header = TRUE, as.is = TRUE, sep = ",")
## Only Sample2
csv <- csv %>% dplyr::filter(Sample == "PFCSample2") %>%
dplyr::mutate(Sample = gsub("Sample", "sample", Sample)) %>%
dplyr::mutate(index = gsub("PFCSample2_", "", X)) %>%
dplyr::rename(clusters = CellType,
clusters_coarse = L2_clusters) %>%
dplyr::select(index, clusters_coarse, clusters, everything())
write.table(csv, file = "cells_in_pfcsample2.csv",
row.names = FALSE, col.names = TRUE,
sep = ",", quote = FALSE)
- The BAM file from the AdultMouse_Rep3 sample was downloaded (accession date January 23, 2020) from SRA.
- FASTQ files were extracted from the BAM file using bamtofastq v1.1.2:
bamtofastq_1.1.2/bamtofastq --reads-per-fastq=500000000 AdultMouse_Rep3_possorted_genome_bam.bam FASTQtmp
mkdir -p FASTQ
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_I1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_I1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R1_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R2_001.fastq.gz FASTQ/AdultMouseRep3_S1_L001_R2_001.fastq.gz
- The sample annotation file (
spermatogenesis_loupe_celltypes_repl3.csv
) was obtained by downloading the loupe fileMouse Unselected Spermatogenic cells.cloupe
from here, loading it into the 10x Genomics Loupe browser and exporting the "Cell Type" labels. The resulting file was subset to only include replicate 3 and cell types with at least 5 cells, using the following R code:
suppressPackageStartupMessages({
library(dplyr)
})
csv <- read.delim("Spermatogenesis_Loupe_CellTypes.csv",
header = TRUE, as.is = TRUE, sep = ",")
## Only Replicate 3
csv <- csv %>% dplyr::filter(grepl("-3", Barcode)) %>%
dplyr::mutate(index = gsub("-3", "", Barcode)) %>%
dplyr::mutate(clusters = Cell.Types,
clusters_coarse = Cell.Types) %>%
dplyr::select(index, clusters_coarse, clusters)
## Remove rare cell types (<5 cells)
tbl <- table(csv$clusters)
kp <- names(tbl)[tbl >= 5]
csv <- csv %>% dplyr::filter(clusters %in% kp)
write.table(csv, file = "spermatogenesis_loupe_celltypes_repl3.csv",
row.names = FALSE, col.names = TRUE,
sep = ",", quote = FALSE)
All reference files were obtained from GENCODE, mouse release M21. In addition to the reference files and indices generated in the respective Makefiles, the following indices were prepared:
The CellRanger
index was generated as follows (with CellRanger
v3.0.2):
cellranger mkref --nthreads 64 \
--genome=cellranger3_0_2_GRCm38.primary_assembly_gencodeM21_spliced \
--fasta=GRCm38.primary_assembly.genome.fa \
--genes=gencode.vM21.annotation.gtf \
--ref-version=GRCm38.primary_assembly_gencodeM21_spliced
The STAR
index used for quantification was generated using STAR
v 2.7.3a:
mkdir star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150
STAR \
--runThreadN 48 \
--runMode genomeGenerate \
--genomeDir ./star2_7_3a_GRCm38.primary_assembly_gencodeM21_spliced_sjdb150 \
--genomeFastaFiles GRCm38.primary_assembly.genome.fa \
--genomeSAindexNbases 14 \
--genomeChrBinNbits 18 \
--sjdbGTFfile gencode.vM21.annotation.gtf \
--sjdbOverhang 150
Finally, the transcript-to-gene mapping files (in .rds
and .txt
format) were generated with R v3.6/Bioconductor release 3.9 as follows:
suppressPackageStartupMessages({
library(rtracklayer)
library(dplyr)
})
entrez <- read.delim("gencode.vM21.metadata.EntrezGene", header = FALSE, as.is = TRUE) %>%
setNames(c("transcript_id","entrez_id"))
gtf <- rtracklayer::import("gencode.vM21.annotation.gtf")
gtftx <- subset(gtf, type == "transcript")
gtfex <- subset(gtf, type == "exon")
df <- data.frame(gtftx, stringsAsFactors = FALSE) %>%
dplyr::select(transcript_id, seqnames, start, end, strand, source,
gene_id, gene_type, gene_name, level, havana_gene, transcript_type,
transcript_name, transcript_support_level, tag, havana_transcript) %>%
dplyr::left_join(data.frame(gtfex, stringsAsFactors = FALSE) %>%
dplyr::group_by(transcript_id) %>%
dplyr::summarize(transcript_length = sum(width)),
by = "transcript_id") %>%
dplyr::left_join(entrez, by = "transcript_id")
dfsub <- df %>% dplyr::select(transcript_id, gene_id)
write.table(dfsub, file = "gencode.vM21.annotation.tx2gene.txt",
sep = "\t", quote = FALSE, row.names = FALSE,
col.names = FALSE)
saveRDS(dfsub, file = "gencode.vM21.annotation.tx2gene.rds")
saveRDS(df, file = "gencode.vM21.annotation.tx2gene_full.rds")