mikessh/migec

Wanting to extract 400nt long reads after demuxing using checkout but not sure what is happening to the sequences

DanBolland opened this issue · 2 comments

Hi Mikhail,

We have done 400+100 asymmetric sequencing and are now wanting to extract just the long 400nt V end reads after demultiplexing using checkout to look at V gene usage (after deduplicating using the UMI). However we have found that checkout seems to copy the reverse/complement of read 1 into read 2 (along with the UMI and it's quality scores into the header - see below) and read 2 goes missing. Is this what should happen? What is the purpose of this? Is there a way we can obtain just the long V read with the UMI?

Original FastQ R1 (400bp):
@HWI-M02293:263:000000000-B3NFW:1:1101:20720:1131 1:N:0:
GGGTGTGACCAATTGGGCAGCCCTGATTGGGGGAAGACATTTGGGAAGGACTGACTCTCTGCAGAGACTGTGAGAGTGGTGCCTTGGCCCCAGTAGTCGTAGCTACTACCGTAGTACTAAAGGGACTCTCTTGCACAGTAATACATGGCTGTGTCCTCAGACTTAAGATGGCTCATTTGCAGGTACAGGTTGTTCTTGGCATTGTCTCTGGAGATGGTGAATCTGGCCTTTACATTGTCCGGATAGTAGGTGTAAATACCACACTAACTAATGGTTGCCACCATCCCCAGCCTTTTTTCACGGTTCTGGTGACACCAATACTGGGCATAGCGACTGAACAGGAACTCAGGGGTTTCACAGGACAGAGTACGAGACACTACAAGTGTCACAATAACAACCCA
+
CCCCCGGGFGGGGGCGG@CEGG;FCEFG9FGDGEGGGGCFFGGGGGGGDFGFFGGCEFFGGFGGGGGGGGGGDFG@FCFFGG<FGC@FGGGGGG9FGFDFD,CFFGGCG8FFC7BE,EFGGAAFCCFFAFFA9EE95B5E<@,?F5CFFECFEEF,BDC<<8E5,8,CFGE,,5CEDDAAFFFFF93DCFFCBD9=FGC,8>>@d,@;,,<=3,8F@,,3@,,,7,,3,<>@FGF,9D,,:C7C>2,4?9,4?9C6@################################################################################################################################################
Original FastQ R2 (100bp):
@HWI-M02293:263:000000000-B3NFW:1:1101:20720:1131 3:N:0:
NATATGACCACAGTGGTATCAACGCAGGGTTATCTTGTATCATTTCTTGGGGGGAGCTCTGACAGAGGAGGCCGGTACTGGNTTCTAGTTCNTCNCATTCA
+
#8ACCFCFG<FFGGGGGGFGGGFEEBF+@fcf@<FC<CC<EFFEFFGGFGGDGGGGGCCF9?F9FG8+4?:FB7F=+CBFC#::B+F<F############

Checkout R1:
@HWI-M02293:263:000000000-B3NFW:1:1101:20720:1131 3:N:0: R2 UMI:TATCTGTACATT:CF@<C<CCEFFE
GGGGGAAGACATTTGGGAAGGACTGACTCTCTGCAGAGACTGTGAGAGTGGTGCCTTGGCCCCAGTAGTCGTAGCTACTACCGTAGTACTAAAGGGACTCTCTTGCACAGTAATACATGGCTGTGTCCTCAGACTTAAGATGGCTCATTTGCAGGTACAGGTTGTTCTTGGCATTGTCTCTGGAGATGGTGAATCTGGCCTTTACATTGTCCGGATAGTAGGTGTAAATACCACACTAACTAATGGTTGCCACCATCCCCAGCCTTTTTTCACGGTTCTGGTGACACCAATACTGGGCATAGCGACTGAACAGGAACTCAGGGGTTTCACAGGACAGAGTACGAGACACTACAAGTGTCACAATAACAA
+
9FGDGEGGGGCFFGGGGGGGDFGFFGGCEFFGGFGGGGGGGGGGDFG@FCFFGG<FGC@FGGGGGG9FGFDFD,CFFGGCG8FFC7BE,EFGGAAFCCFFAFFA9EE95B5E<@,?F5CFFECFEEF,BDC<<8E5,8,CFGE,,5CEDDAAFFFFF93DCFFCBD9=FGC,8>>@d,@;,,<=3,8F@,,3@,,,7,,3,<>@FGF,9D,,:C7C>2,4?9,4?9C6@############################################################################################################################################
Checkout R2:
@HWI-M02293:263:000000000-B3NFW:1:1101:20720:1131 1:N:0: R1 UMI:TATCTGTACATT:CF@<C<CCEFFE
TGGGTTGTTATTGTGACACTTGTAGTGTCTCGTACTCTGTCCTGTGAAACCCCTGAGTTCCTGTTCAGTCGCTATGCCCAGTATTGGTGTCACCAGAACCGTGAAAAAAGGCTGGGGATGGTGGCAACCATTAGTTAGTGTGGTATTTACACCTACTATCCGGACAATGTAAAGGCCAGATTCACCATCTCCAGAGACAATGCCAAGAACAACCTGTACCTGCAAATGAGCCATCTTAAGTCTGAGGACACAGCCATGTATTACTGTGCAAGAGAGTCCCTTTAGTACTACGGTAGTAGCTACGACTACTGGGGCCAAGGCACCACTCTCACAGTCTCTGCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCC
+
################################################################################################################################################@6c9?4,9?4,2>C7C:,,D9,FGF@><,3,,7,,,@3,,@f8,3=<,,;@,D@>>8,CGF=9DBCFFCD39FFFFFAADDEC5,,EGFC,8,5E8<<CDB,FEEFCEFFC5F?,@<E5B59EE9AFFAFFCCFAAGGFE,EB7CFF8GCGGFFC,DFDFGF9GGGGGGF@CGF<GGFFCF@GFDGGGGGGGGGGFGGFFECGGFFGFDGGGGGGGFFCGGGGEGDGF9

The R2 sequence is perfectly complementary to R1 apart from TGGG at the start.

Dear Dan,

Can you please post the command you've used and the barcode sequence.

Running migec Checkout -cute barcodes.txt R1.fastq R2.fastq ch/ with your two original reads produced correct result for me.

Hi mikessh,
Allow me to chip in here. We have used the following command:

java -jar migec-1.2.6/migec-1.2.6.jar Checkout -cute barcodes.txt lane1_merged_L001_R1.fastq.gz lane1_merged_L001_R2.fastq.gz ./checkout

And here are the barcodes:

BC_IgM	CGATGTcagtggtatcaacgcagagtNNNNtNNNNtNNNNtct	CGATGTattgggcagccctgatt
DHL	TGACCAcagtggtatcaacgcagagtNNNNtNNNNtNNNNtct	TGACCAattgggcagccctgatt
EHL	ACAGTGcagtggtatcaacgcagagtNNNNtNNNNtNNNNtct	ACAGTGattgggcagccctgatt
FHL	GCCAATcagtggtatcaacgcagagtNNNNtNNNNtNNNNtct	GCCAATattgggcagccctgatt
FO1HL	CAGATCcagtggtatcaacgcagagtNNNNtNNNNtNNNNtct	CAGATCattgggcagccctgatt

I believe the example read above came from the DHL sample. Many thanks! Felix