milaboratory/mixcr

Artificial gaps in reported amino acid sequences

Closed this issue · 2 comments

When looking at the output of exportClones from bulk IgG sequences, I have quite a few sequences which are productive (i.e. no stop, no frame shift, etc) and complete/non-gapped nt sequences, but the amino acid translations have _ reported. I am using v4.6.0, a custom reference, and assemble contigs by VDJRegion.

This is the code I ran:
mixcr exportClones --filter-out-of-frames --filter-stops -nMutations "{FR1Begin:FR3End}" -nMutationsRate "{FR1Begin:FR3End}" -aaMutations "{FR1Begin:FR3End}" -aaMutationsRate "{FR1Begin:FR3End}" 5RACEbulk.TermSpleen.allelic.clns 5RACEbulk.TermSpleen.allelic.tsv

An example of a concerning output is:

Target_sequence
GAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCCCTCACCTGCACTGTGTCCGGGGGCTCAGTCAGCAACATTTACTATTGGCACTGGATCCGCCAGCCCCCGGGGAGAGGGTTGCAGTGGATGGGGTATTGGACAGGCAGCACAAACTACAACCCAGCCTTCCAGGGCCGCATCTCCATCTCTCCTGATACGGCCAGAACCCAGTTGTCCCTGCAACTGAATTCCGTGACCGCCGAGGACACGGCCTTCTATTACTGTGCAAGAGGGTTGTGGAGTAGCGGGGGCTGGTTCGACTCCTGGGGCCAGGGCACTCTGGTCACCGTCTCCTCCG

Specifically concerning are these feature translations:

aaSeqFR1
EVQLQESGPGLV_ALADPLPHLHCV
aaSeqCDR1
RGLSQ_NIYY
aaSeqFR2
WHWIRQPP_ERVAVDGV
aaSeqCDR2
LDR_ST
aaSeqFR3
NYNPAFQGRISISPDTARTQLSLQLNSVTAEDTAFYY
aaSeqCDR3
CARGLWSSGGWFDSW
aaSeqFR4
GQGTLVTVSS_

When I translate the target sequence, I get the following:

rf 1 Target_sequence
EVQLQESGPGLVKPSQTLSLTCTVSGGSVSNIYYWHWIRQPPGRGLQWMGYWTGSTNYNP
AFQGRISISPDTARTQLSLQLNSVTAEDTAFYYCARGLWSSGGWFDSWGQGTLVTVSS

And when I put the nucleotide sequence into IgBLAST this is the output I get which is what I would expect from MiXCR:

aaSeqFR1
VQLQESGPGLVKPSQTLSLTCTVS
aaSeqCDR1
GGSVSNIYY
aaSeqFR2
WHWIRQPPGRGLQWMG
aaSeqCDR2
YWTGST
aaSeqFR3
NYNPAFQGRISISPDTARTQLSLQLNSVTAEDTAFYYC
aaSeqCDR3
ARGLWSSGGWFDS
aaSeqFR4
WGQGTLVTVSS

Hi, it's hard to tell without looking at the alignment of the specific clone, but I think I know what is going on. So this is a tricky thing. My guess is there are some nucleotide deletions in the sequence. We know that all FRs should end with a complete codon thus should have a strict AA border. If it doesn't, it is not clear where to assign the last AA during the translation. Thats why we translate each FR and CDR from both ends moving towards the center and in a normal scenario this returns an accurate translation. If one nucleotide was deleted or inserted there will be a wrong AA sequence for this region, when exported separately. So this is done in order to correctly define AA borders of the segments ending with a complete codon. The workaround here is to add an export column with the VRegion or the VDJRegion for example. These regions are translated from one end, eliminating the need to distinguish segment borders, and return the correct AA receptor sequence. Use -aaFeature VRegion or -aaFeature VDJRegion for this purpose.

Ok, thank you!