Insert size estimated from alignment for "mp" is short
simoncchu opened this issue · 5 comments
Hi,
I am using NxTrim to trim one human mate-pair data (NA12878), which is downloaded from http://www.ebi.ac.uk/ena/data/view/ERP002490. And I run nxtrim with the command: “NxTrim/nxtrim -1 1.fastq -2 2.fastq -O test”, and I get four .gz files. Then I try to align the "test.mp.fastq.gz" with bwa ((first extract and split to left and right reads)). And the estimated insert size is around 300bp. But the insert size for mate-pair should be around 2000bp. Is there anything I did wrong?
I also align the "test.unkown.fastq.gz" with bwa, and the estimated insert size is around 2000bp.
And same for "test.pe.fastq.gz", the estimated insert size is around 2000bp.
But why the insert size for "test.mp.fastq.gz" is only 300bp?
Thank you.
Best,
Chong
I can recreate this issue on that particular data set, so it is definitely a bug.
I cannot recreate this behaviour smaller bacterial data sets so I can't immediately say what is causing it, will investigate.
Unfortunately I think this data set has some strange artifacts. I should not have used this as an example on the wiki.
Take this read pair for for example:
ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/1
TTTGAGTCCAAATTCCTGAGGGCACCTTTTAGTGACTAACATTTAAACATCATCTTCTTAGGAAGTTTCTGGACTGTCTCTTATACACATCTAGATGTGTA
+
>5AC>ECEDEFDB?FHFHHFHDHIIHHGEIGHIIJJIIGGFHFGIGGHGFHJIJIGIHGJJIJIGHGIIHF?JJJIEIIIJGJIIHGJHGHHHFFFFFCB@
@ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/2
ATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAAATATACATCCACTGCATGTAGCTTGATCTCCTGGAAGAATAAACTTTGTAA
+
CCCCCEFDCEECCCC@DDDDEGHHEHC;JJIGIHHCIJJIHGGIIEIJIGFIJIHIHFJJHJGHJIIHEFBIIGHHEIIIIHGEJIHJHHHHHDDFFFBB@
the adapter starts at position 73 in R1 so after trimming we get:
@ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/1
TTTGAGTCCAAATTCCTGAGGGCACCTTTTAGTGACTAACATTTAAACATCATCTTCTTAGGAAGTTTCTGGA
+
>5AC>ECEDEFDB?FHFHHFHDHIIHHGEIGHIIJJIIGGFHFGIGGHGFHJIJIGIHGJJIJIGHGIIHF?J
@ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/2
ATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAAATATACATCCACTGCATGTAGCTTGATCTCCTGGAAGAATAAACTTTGTAA
+
CCCCCEFDCEECCCC@DDDDEGHHEHC;JJIGIHHCIJJIHGGIIEIJIGFIJIHIHFJJHJGHJIIHEFBIIGHHEIIIIHGEJIHJHHHHHDDFFFBB@
So this should definitely be a MP given how the Nextera kit works (disclaimer: I am not a molecular biologist).
But then we we align:
ERR262996.9358 97 12 106069025 60 73M = 106069048 124 TCCAGAAACTTCCTAAGAAGATGATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAA J?FHIIGHGIJIJJGHIGIJIJHFGHGGIGFHFGGIIJJIIHGIEGHHIIHDHFHHFHF?BDFEDECE>CA5> NM:i:0 MD:Z:73AS:i:73 XS:i:19
ERR262996.9358 145 12 106069048 60 101M = 106069025 -124 ATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAAATATACATCCACTGCATGTAGCTTGATCTCCTGGAAGAATAAACTTTGTAA CCCCCEFDCEECCCC@DDDDEGHHEHC;JJIGIHHCIJJIHGGIIEIJIGFIJIHIHFJJHJGHJIIHEFBIIGHHEIIIIHGEJIHJHHHHHDDFFFBB@ NM:i:0 MD:Z:101 AS:i:101 XS:i:19
The fragment length is only 124 which would imply the reads should overlap and hence R2 should have some adapter sequence, but it doesn't. NxTrim (and yourself) appear to be doing the correct thing but there is something wrong with the data.
I think the solution to this "bug" is that I upload a better quality data set to ENA and change the link on the wiki!
I am in the process of uploading a higher quality MP data set, and will let you know when it is done.
Thank you for all the help. Then I think I need to re-analyze the data after you finish uploading. BTW, can you upload the same individuals if possible? Especially for NA12878.
I will upload a recent NA12878 run. Unfortunately I don't think we have the whole trio.
This data set is really quite old and not representative of the current standard. We probably should look at getting it taken off ENA.