Teichlab/tracer

very low overall alignment rate with bowtie2

noorisotoudeh opened this issue · 9 comments

Thanks for your nice package for single cell TCR analysis. I have a question about alignment. Surprisingly, all of my fastq files show very low mapping rate. I checked fastqc file and it looks good but I don't know why it should show very low percent. I can run test data with 55.51% overall alignment rate TCR_A and 44.19% overall alignment rate TCR_B.
i use the following command and this is the alignment output:

**$ tracer assemble -p 8 -c tracer.conf -s Hsap fastq1.gz fastq2.gz name output_dir
**##Finding recombinant-derived reads##
Attempting new assembly for ['TCR_A', 'TCR_B']

##TCR_A##
801683 reads; of these:
801683 (100.00%) were paired; of these:
801376 (99.96%) aligned concordantly 0 times
307 (0.04%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
801376 pairs aligned concordantly 0 times; of these:
6 (0.00%) aligned discordantly 1 time
----
801370 pairs aligned 0 times concordantly or discordantly; of these:
1602740 mates make up the pairs; of these:
1602555 (99.99%) aligned 0 times
185 (0.01%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.05% overall alignment rate
##TCR_B##
801683 reads; of these:
801683 (100.00%) were paired; of these:
801548 (99.98%) aligned concordantly 0 times
135 (0.02%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
801548 pairs aligned concordantly 0 times; of these:
1 (0.00%) aligned discordantly 1 time
----
801547 pairs aligned 0 times concordantly or discordantly; of these:
1603094 mates make up the pairs; of these:
1602962 (99.99%) aligned 0 times
130 (0.01%) aligned exactly 1 time
2 (0.00%) aligned >1 times
0.03% overall alignment rate****

I have also edited the header of fastq files by removing sequence index and etc.. but it didn't change the result. here is the original header of my fastq file

@NB501311:706:HNFTMBGXH:1:11101:6390:1116 1:N:0:CGGAGCCT+ATAGAGAG
AGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCT
+
AAAAAEEEEEEEEEEEEEEE/EEEEEEEEAEEAEAEEE
@NB501311:706:HNFTMBGXH:1:11101:3328:1118 1:N:0:CGGAGCCT+ATAGAGAG
GTGGAGATACCTCCTGTGTCTCCAGGATGGGTGGAGAT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501311:706:HNFTMBGXH:1:11101:14759:1145 1:N:0:CGGAGCCT+ATAGAGAG
CCCTAGGCTTCCCCCGTGTGCCTGGACGAGTGCTGGTG

However i have name_TCRseqs.fa file produced by tracer but for many other files it is empty.
do you think there is something wrong with my data or running?

Thanks,

Noori

Hi Mike,
thanks for your quick response. it is full-length single cell RNA-seq libraries were prepared using the SMART-seq2 protocol. cDNA was fragmented using Illumina and amplified with indexed Nextera PCR primers.

yes they are

Great, thanks.

Looking at this in a bit more detail I think that your mapping rates look to be at roughly what one might expect in data from a genuine cell. The test data rates (~50%) are so high because those test input files are selected to be enriched for the TCR sequences.

For what proportion of your cells do you get reconstructed TCR sequences. If you run tracer summarise for all of your cells, what does TCR_summary.txt say?

TCR_A reconstruction: 35 / 109 (32.1%)
TCR_B reconstruction: 13 / 109 (11.9%)

AB productive reconstruction: 7 / 109 (6.4%)

+--------+----------------+---------------+----------------+
| | 0 recombinants | 1 recombinant | 2 recombinants |
+--------+----------------+---------------+----------------+
| all A | 71 | 34 (89%) | 4 (11%) |
| all B | 96 | 13 (100%) | 0 (0%) |
| prod A | 74 | 33 (94%) | 2 (6%) |
| prod B | 96 | 13 (100%) | 0 (0%) |
+--------+----------------+---------------+----------------+

#Clonotype groups#
This is a text representation of the groups shown in clonotype_network_with_identifiers.svg.
It does not exclude cells that only share beta and not alpha.

Thanks!

So, it looks like reconstruction is working but at a fairly low rate.

Without knowing more about your experiment I'd guess that this is due to some property of the cells or the sequencing.

  • are the cells naive or memory/activated? naive T cells have lower TCR expression so its harder to reconstruct
  • are the cells sequenced deeply? shallow per-cell sequencing can lead to insufficient reads for reconstruction

Sorry I can't be more help.

M

thank you so much for your help. yes, i think the most of them are naive. so probably that's why they can not be mapped more.
Thanks,
Noori

No problem. Good luck with the experiments!