broadinstitute/Drop-seq

DigitalExpression: Running digital expression without somehow selecting a set of barcodes to process no longer supported

yz3n18 opened this issue · 2 comments

Affected tool(s)

DigitalExpression

Affected version(s)

  • Drop-seq_tools-2.4.1
  • picard.jar-2.26.1

Description

Hi, I've recently re-analysed public data from GSE141259 on University HPC.
Following the McCarroll lab protocol from 2018, everything went well until I was using DigitalExpression.

Steps to reproduce

Take SRR10574069 as an example.
Step1
prefetch SRR10574069
Step2
fastq-dump --gzip --split-files -A SRR10574069 SRR10574069.sra

Step3-STAR index was downloaded from GSE63269_hg19_mm10_transgenes_reference_metadata.tar.gz
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir reference/star/ --genomeFastaFiles STAR_reference/hg19_mm10_transgenes.fasta --sjdbGTFfile STAR_reference/hg19_mm10_transgenes.gtf --sjdbOverhang 99

Step4-ReduceGtf
Drop-seq_tools-2.4.1/ReduceGtf SD= STAR_reference/hg19_mm10_transgenes.dict GTF=STAR_reference/hg19_mm10_transgenes.gtf OUTPUT=STAR_reference/hg19_mm10_transgenes.reduced.gtf

Step5-CreateIntervalsFiles
Drop-seq_tools-2.4.1/CreateIntervalsFiles SD=STAR_reference/hg19_mm10_transgenes.dict REDUCED_GTF=STAR_reference/hg19_mm10_transgenes.reduced.gtf PREFIX=hg19_mm10_transgenes OUTPUT=STAR_reference

Step6-FastqToSam
java -jar picard.jar FastqToSam F1=../Raw/SRR10574069_1.fastq.gz F2=../Raw/SRR10574069_2.fastq.gz O=SRR10574069_unaligned_read_pairs.bam SM=SRR10574069 SORT_ORDER=queryname

Step7-Cellular tag
Drop-seq_tools-2.4.1/TagBamWithReadSequenceExtended SUMMARY=Picard/SRR10574069_Unaligned_BAM_rawdata_Cellular.bam_summary.txt BASE_RANGE=1-12 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 OUTPUT=Picard/SRR10574069_unaligned_read_Cellular.bam INPUT=Picard/SRR10574069_unaligned_read_pairs.bam

Step8-Molecular tag
Drop-seq_tools-2.4.1/TagBamWithReadSequenceExtended SUMMARY=Picard/SRR10574069_Unaligned_BAM_rawdata_Molecular.bam_summary.txt BASE_RANGE=13-20 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=True TAG_NAME=XM NUM_BASES_BELOW_QUALITY=1 OUTPUT=Picard/SRR10574069_unaligned_read_Molecular.bam INPUT=Picard/SRR10574069_unaligned_read_Cellular.bam

Step9-FilterBam
Drop-seq_tools-2.4.1/FilterBam TAG_REJECT=XQ INPUT=Picard/SRR10574069_unaligned_read_Molecular.bam OUTPUT=Picard/SRR10574069_unaligned_read_Molecular_filtered.bam

Step10-TrimStarting
Drop-seq_tools-2.4.1/TrimStartingSequence INPUT=Picard/SRR10574069_unaligned_read_Molecular_filtered.bam OUTPUT=Picard/SRR10574069_unaligned_read_Molecular_trimmed.bam OUTPUT_SUMMARY=Picard/SRR10574069_unaligned_read_Molecular_trimming_report.txt SEQUENCE=AAGCAGTGGTATCAACGCAGAGTGAATGGG MISMATCHES=0 NUM_BASES=5

Step11-PolyATrimmer
Drop-seq_tools-2.4.1/PolyATrimmer INPUT=Picard/SRR10574069_unaligned_read_Molecular_trimmed.bam OUTPUT=Picard/SRR10574069_unaligned_read_Molecular_tagged_polyA_filtered.bam OUTPUT_SUMMARY=Picard/SRR10574069_unaligned_read_Molecular_tagged_polyA_trimming_report.txt MISMATCHES=0 NUM_BASES=6

Step12-SamToFastq
java -Xmx4g -jar picard.jar SamToFastq INPUT=Picard/SRR10574069_unaligned_read_Molecular_tagged_polyA_filtered.bam FASTQ=Picard/SRR10574069_unaligned_read_Molecular_tagged_polyA_filtered.fastq

Step13-Alignment
STAR --runThreadN 16 --genomeDir reference/star --readFilesIn Picard/SRR10574069_unaligned_read_Molecular_tagged_polyA_filtered.fastq --outFileNamePrefix STAR_SRR10574069

Step14-SortSam
java -Xmx4g -jar picard.jar SortSam I=Picard/STAR_SRR10574069Aligned.out.sam O=Picard/STAR_SRR10574069Aligned.querysorted.bam SO=queryname

Step15-Merge
java -Xmx4g -jar picard.jar MergeBamAlignment REFERENCE_SEQUENCE=STAR_reference/hg19_mm10_transgenes.fasta UNMAPPED_BAM=Picard/SRR10574069_unaligned_read_Molecular_tagged_polyA_filtered.bam ALIGNED_BAM=Picard/STAR_SRR10574069Aligned.querysorted.bam OUTPUT=Picard/SRR10574069_merged.bam INCLUDE_SECONDARY_ALIGNMENTS=false PAIRED_RUN=false

Step16-WithInterval
Drop-seq_tools-2.4.1/TagReadWithInterval I=Picard/SRR10574069_merged.bam O=Picard/SRR10574069_merged_gene_tagged.bam INTERVALS=STAR_reference/hg19_mm10_transgenes.genes.intervals TAG=XG

Step17-TagRead
Drop-seq_tools-2.4.1/TagReadWithGeneExonFunction I=/2019_Strunz/Picard/SRR10574069_merged_gene_tagged.bam O=/2019_Strunz/Picard/SRR10574069_merged_function_tagged.bam ANNOTATIONS_FILE=/reference/star/2019_Strunz_STAR_reference/hg19_mm10_transgenes.refFlat TAG=GE

Step18-Digital
Drop-seq_tools-2.4.1/DigitalExpression I=Picard/SRR10574069_merged_function_tagged_clean.bam O=Picard/SRR10574069_star_gene_exon_tagged.dge.txt.gz SUMMARY=Picard/SRR10574069_star_gene_exon_tagged.dge.summary.txt MIN_NUM_GENES_PER_CELL=200 TMP_DIR=Picard/tmp_dir

Actual behavior

INFO 2021-09-06 18:31:26 BamTagOfTagCounts Processed 70,000,000 records. Elapsed time: 00:03:01s. Time for last 1,000,000: 2s. Last read position: hs37d5:27,461,543. Last read name: SRR10574069.13495697
ERROR 2021-09-06 18:31:26 DigitalExpression Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Mon Sep 06 18:31:26 BST 2021] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 12.15 minutes.
Runtime.totalMemory()=3356491776

Final bam file look

samtools view SRR10574069_merged_function_tagged_clean.bam | head -n5
SRR10574069.12722766 0 HUMAN_1 243123 1 15M110177N11M * 0 0 ACACCAAACTGCGTGCACCTCAAAAC --<--<--77-FJ--7---7A-77-- XC:Z:GATTAGCCATCT MD:Z:13G0A11 XF:Z:INTRONIC PG:Z:STAR RG:Z:A XG:Z:AP006222.2,RP4-669L17.1NH:i:3 NM:i:2 XM:Z:GTTTCCAG ZP:i:27 UQ:i:24 AS:i:18
SRR10574069.43563193 16 HUMAN_1 717336 1 19S55M * 0 0 ATTATCTGCTTTTTAAACTGTTGACATTAATCAACTCATTATTCCTCACAGATTTGTATTCTCCAGAGTATCCA JJAJJJJFJJJJAJJJAFFJJJJJJJJFJ<FJFFJJJJJJJJJJFFJAJJFJJJJJJJJFJFJA<AFFJFF<<A XC:Z:CGCAGTTACTGA MD:Z:55 XF:Z:INTERGENIC PG:Z:STAR RG:Z:A XG:Z:RP11-206L10.9 NH:i:3 NM:i:0 XM:Z:GAGATCAT ZP:i:75 UQ:i:0 AS:i:54
SRR10574069.1455304 16 HUMAN_1 722044 255 19M1S * 0 0 GGTTTCAGATTTTTCATCCT A7--AA77-A7----77<-AXC:Z:TCCTAACAAAAT MD:Z:19 XF:Z:INTERGENIC PG:Z:STAR RG:Z:A XG:Z:RP11-206L10.9 NH:i:1 NM:i:0 XM:Z:TTACAGAA ZP:i:21 UQ:i:0 AS:i:18
SRR10574069.26503679 0 HUMAN_1 743340 1 17M3S * 0 0 GCAGGGAAAGAGTTCATCTT A---7-77--7-7-----7-XC:Z:TACCTCGATGTG MD:Z:17 XF:Z:INTRONIC PG:Z:STAR RG:Z:A XG:Z:RP11-206L10.9 NH:i:4 NM:i:0 XM:Z:TCAAGGCT ZP:i:21 UQ:i:0 AS:i:16
SRR10574069.36157857 16 HUMAN_1 753348 255 15M182700N10M * 0 0 GAGTGAGTTCCGCGGCACGGAGGGC ----77---7-----77--77-<-- XC:Z:GAGACCAGGCTG MD:Z:13A11 XF:Z:INTERGENIC PG:Z:STAR RG:Z:A XG:Z:FAM87B NH:i:1 NM:i:1 XM:Z:CAATAAAA ZP:i:26 UQ:i:12 AS:i:19

I appreciate anyone that takes the time to read. Thanks !

Kind regards,
Lu

alecw commented

Hi Lu,

Your second-to-last step used an old tool that is obsolete. The result is that the tags that DigitalExpression looks for by default are not found, so it doesn't find any reads that map to genes, so MIN_NUM_GENES_PER_CELL threshold is not met, and no cells are found to produce DGE for.

Try using TagReadWithGeneFunction instead of TagReadWithGeneExonFunction. Note that you also don't need TagReadWithInterval.

BTW, you can run TagReadWithGeneFunction on your already-tagged BAM. The obsolete tags don't do any harm, although it would be cleaner to tag the output of MergeBamAlignment, to avoid confusion.

Regards, Alec

Hi Alec,

It worked well.
Thank you so much for all of your help with this.

Kind regards,
Lu