COMBINE-lab/salmon

Identifying sampletags in R2 sequence from Rhapsody

Starahoush opened this issue · 2 comments

Hello there!

I am analysing BD Rhapsody Single cell data and am wondering how to proceed with Salmon Alevin. After alignment I expected to create a Seurat Object with information of sample tags obtained from R2 but I could not achieve this.

Background

This is mice data comprised of 2 sequences: R1 contains CB (27bp divided in three sections of 9bp) and UMI, while R2 contains sample tag information and the transcript info as well. I need to differentiate not only the CB, but also the sampletags present in R2 since there are 5 different samples per cartridge.

As stated in page 30 of the BD Library Preparation Protocol, each sampletag is 70bp long (sampletag + abseq).

I have followed the BD Single-Cell Multiomics Bioinformatics Handbook, in which page 20 states:

To account for every Sample Tag, each Sample Tag sequence in the kit is considered during pipeline analysis, whether the Sample Tags are used in the experiment or specified with a sample name.
The pipeline automatically adds the Sample Tag sequences to the FASTA reference file. Reads that align to a Sample Tag sequence and associate with a putative cell are used to identify the sample for that cell.

What I have tried so far

  1. Added the sample tag sequences to the end of gentrome file
[user@remote]$ tail -n 80 m.mus_gentrome.fa.gz
>JH584295.1 dna_sm:scaffold scaffold:GRCm39:JH584295.1:1:1976:1 REF
GGCTGAGCGGTGACATCATGGGCGGCGGGGTCCCAGACAGGAAGTGGGCGTGGCCTCCCA
CACTCACCCTGGCCCGCGGCGTCTGCCAGGTCGCTGTCCGAGATGCCGCCTGTggggggg
[...]
>sampletag11G
GTTGTCAAGATGCTACCGTTCAGAGGGTTGGCTCAGAGGCCCCAGGCTGCGGACGTCGTCGGACTCGCGT
>sampletag12G
GTTGTCAAGATGCTACCGTTCAGAGCTGGGTGCCTGGTCGGGTTACGTCGGCCCTCGGGTCGCGAAGGTC
  1. Added the sample tag names to the end of decoy file
[user@remote]$  tail -n 15 m.musculus_decoys.txt
GL456368.1
MU069434.1
JH584295.1
sampletag1G
sampletag2G
[...]
sampletag12G
  1. Created the index
salmon index -t m.mus_gentrome.fa.gz -d m.musculus_decoys.txt -p 12 -i m.mus_salmon_index --gencode
  1. Aligned the 4 fastq files from the first cartridge (2 Lanes, each with one R1 and one R2)
salmon alevin -l ISR -1  Library1_WTA_S1_L00*_R1_001.fastq -2 Library1_WTA_S1_L00*_R2_001.fastq -i reference_genome/m.musculus/m.mus_salmon_index -p 10 --whitelist reference_genome/m.musculus/bd_rhapsody_barcode.txt -o alevin_output --umi-geometry '1[53-60]' --bc-geometry '1[1-9,22-30,44-52]' --read-geometry '2[1-end]' --tgMap reference_genome/m.musculus/txp2gene_2.tsv

In which:
whitelist is a file containing possible CBs (I have not added anything related to sample tags here)
txp2gene is a file that I did not change anything also.

  1. Loaded the output in R with ReadAlevin from SeuratWrappers

My issue

I expected to have information for each cell (CB) which sampletag was present in its reads (vide page 20, figure 13 of the Handbook)
image

This information however is not present in the SeuratObject neither in the alevin_output directory (as seen by grep -r "sample" not returning anything). Can you help me out with this? I am not sure why I don't have any reads mapping to the sampletags and have tried everything I could think of.

Kind regards,
Igor

The problem was that I added the sequences as decoy only, had to add twice.

Hello @Starahoush, greetings from Brazil! I'm an undergraduate student in Biomedical Informatics at the Federal University of Paraná, currently involved in a scientific initiation project in an immunology lab. I'm working extensively with FASTQ files from samples sequenced on the BD Rhapsody V1 platform and I've been facing a challenge: a significant portion of the reads are being discarded due to "noisy cellular barcodes", with around 50% of the reads affected. Could you please share if you've encountered a similar situation in your experiment or provide some guidance on how to address this issue? I appreciate your attention and assistance!