mikessh/migec

How to determine the exact barcode sequence

jessicaluckygirl opened this issue · 3 comments

Hello,
I have met some troubles on data analysis by applying the IMGEC to analyze the data recently.

Firstly, I don't know what exact barcode sequences should be put into barcode file when de-multiplexing. I use the sequence ((N)2–4CAGTGGTATCAACGCAGAG) as nest PCR primer on the 5′ end of the libraries during the second PCR to increase the diversity. Here I don't add sample barcode, which can separate by adding index during library construction.

Indeed, I have try cagtggtatcaacgcagagtNNNNtNNNNtNNNNtct , as barcode sequence into the barcode file. But the output results is strange, the counts of each clonotype is small. I don't know if there are mistakes during analysis.

In addition, I don't know if the (N)2-4 at the head of cagtggtatcaacgcagagtNNNNtNNNNtNNNNtct
would infulence the final results, so I have tried add 2n/3n/4n in front of the cagtggtatcaacgcagagtNNNNtNNNNtNNNNtct , respectievly. By running the pipeline, we get different results, which can see from attachment below. What's more, there are errors ''UMI size do not match sample size'' when running barcode sequence like 3n/4n in front of the cagtggtatcaacgcagagtNNNNtNNNNtNNNNtct , respectievly.

Could you give some advices on these questions? What exact sequence should we put into the barcode file?
Thanks!

cagtggtatcaacgcagagtNNNNtNNNNtNNNNtct.txt
nncagtggtatcaacgcagagtNNNNtNNNNtNNNNtct.txt

Dear Jessica,

  1. MIGEC allows mismatches in the bases that are lowercase, however an entire sequence should be fit to the read in order to be matched: if your read contains gctactacg.... and the barcode is aagctactacg no hit will be called. I would suggest trimming off any bases that are not included in the sequence (you may need to manually inspect your data for this), so N2-4 should be definitely trimmed. Moreover, sometimes first couple of bases in reads are called as "N" or "A" with low quality. In this case one can also trim first couple of barcode bases, e.g. gtggtatcaaCGCAGAgtNNNNtNNNNtNNNNtct

  2. MIGEC requires an exact match of at least several bases to call a hit (it performs exact search for these bases). So one should make some of the barcode sequence bases uppercase, e.g.:
    gtggtatcaaCGCAGAgtNNNNtNNNNtNNNNtct or gtggtatCAACGcagagtNNNNtNNNNtNNNNtct.txt

Best regards,
Mike

Dear Mike,
Sorry to trouble you again. Thanks for your last suggestions.
First, I want to consult you about what is the appropriate size of data for analyzing the TCR.( at least 2G data or more by HiSeq X ?)
Second, how to normalize the data sizes among many different samples analyzed? (by random selecting?)
Third, I have read the paper you published 《Dynamics of Individual T Cell Repertoires: From Cord Blood to Centenarians》, which applied the method of MIGEC and MiTCR to analyze the data, and in this paper the the threshold is two reads per molecular identifier group. And now I have tried two methods to analyze the data, one is MIGEC alone (five reads per molecular identifier group) , the other is MIGEC (two reads per molecular identifier group) together with MiXCR. Of course, the second method can generate more clonotypes than the first method. Therefore, I hesitate to choose which is the right one to analyze .
Look forward for your replying!
Best,
Jessica
@mikessh