Insert size estimated from alignment for "mp" is short

Question

Insert size estimated from alignment for "mp" is short

simoncchu opened this issue 9 years ago · 5 comments

Hi,

I am using NxTrim to trim one human mate-pair data (NA12878), which is downloaded from http://www.ebi.ac.uk/ena/data/view/ERP002490. And I run nxtrim with the command: “NxTrim/nxtrim -1 1.fastq -2 2.fastq -O test”, and I get four .gz files. Then I try to align the "test.mp.fastq.gz" with bwa ((first extract and split to left and right reads)). And the estimated insert size is around 300bp. But the insert size for mate-pair should be around 2000bp. Is there anything I did wrong?

I also align the "test.unkown.fastq.gz" with bwa, and the estimated insert size is around 2000bp.
And same for "test.pe.fastq.gz", the estimated insert size is around 2000bp.

But why the insert size for "test.mp.fastq.gz" is only 300bp?

Thank you.

Best,
Chong

Answer 1 · 2016-03-31T10:47:01.000Z

I can recreate this issue on that particular data set, so it is definitely a bug.

I cannot recreate this behaviour smaller bacterial data sets so I can't immediately say what is causing it, will investigate.

Answer 2 · 2016-03-31T14:42:02.000Z

Unfortunately I think this data set has some strange artifacts. I should not have used this as an example on the wiki.

Take this read pair for for example:

ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/1
TTTGAGTCCAAATTCCTGAGGGCACCTTTTAGTGACTAACATTTAAACATCATCTTCTTAGGAAGTTTCTGGACTGTCTCTTATACACATCTAGATGTGTA
+
>5AC>ECEDEFDB?FHFHHFHDHIIHHGEIGHIIJJIIGGFHFGIGGHGFHJIJIGIHGJJIJIGHGIIHF?JJJIEIIIJGJIIHGJHGHHHFFFFFCB@
@ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/2
ATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAAATATACATCCACTGCATGTAGCTTGATCTCCTGGAAGAATAAACTTTGTAA
+
CCCCCEFDCEECCCC@DDDDEGHHEHC;JJIGIHHCIJJIHGGIIEIJIGFIJIHIHFJJHJGHJIIHEFBIIGHHEIIIIHGEJIHJHHHHHDDFFFBB@

the adapter starts at position 73 in R1 so after trimming we get:

@ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/1
TTTGAGTCCAAATTCCTGAGGGCACCTTTTAGTGACTAACATTTAAACATCATCTTCTTAGGAAGTTTCTGGA
+
>5AC>ECEDEFDB?FHFHHFHDHIIHHGEIGHIIJJIIGGFHFGIGGHGFHJIJIGIHGJJIJIGHGIIHF?J
@ERR262996.9358 HSQ1008:208:C0VH6ACXX:7:1101:1172:75929/2
ATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAAATATACATCCACTGCATGTAGCTTGATCTCCTGGAAGAATAAACTTTGTAA
+
CCCCCEFDCEECCCC@DDDDEGHHEHC;JJIGIHHCIJJIHGGIIEIJIGFIJIHIHFJJHJGHJIIHEFBIIGHHEIIIIHGEJIHJHHHHHDDFFFBB@

So this should definitely be a MP given how the Nextera kit works (disclaimer: I am not a molecular biologist).

But then we we align:

ERR262996.9358  97  12  106069025   60  73M =   106069048   124 TCCAGAAACTTCCTAAGAAGATGATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAA   J?FHIIGHGIJIJJGHIGIJIJHFGHGGIGFHFGGIIJJIIHGIEGHHIIHDHFHHFHF?BDFEDECE>CA5>   NM:i:0  MD:Z:73AS:i:73  XS:i:19
ERR262996.9358  145 12  106069048   60  101M    =   106069025   -124    ATGTTTAAATGTTAGTCACTAAAAGGTGCCCTCAGGAATTTGGACTCAAAATATACATCCACTGCATGTAGCTTGATCTCCTGGAAGAATAAACTTTGTAA   CCCCCEFDCEECCCC@DDDDEGHHEHC;JJIGIHHCIJJIHGGIIEIJIGFIJIHIHFJJHJGHJIIHEFBIIGHHEIIIIHGEJIHJHHHHHDDFFFBB@   NM:i:0  MD:Z:101    AS:i:101    XS:i:19

The fragment length is only 124 which would imply the reads should overlap and hence R2 should have some adapter sequence, but it doesn't. NxTrim (and yourself) appear to be doing the correct thing but there is something wrong with the data.

I think the solution to this "bug" is that I upload a better quality data set to ENA and change the link on the wiki!

I am in the process of uploading a higher quality MP data set, and will let you know when it is done.

Answer 3 · 2016-03-31T18:00:08.000Z

Thank you for all the help. Then I think I need to re-analyze the data after you finish uploading. BTW, can you upload the same individuals if possible? Especially for NA12878.

Answer 4 · 2016-03-31T18:24:59.000Z

I will upload a recent NA12878 run. Unfortunately I don't think we have the whole trio.

This data set is really quite old and not representative of the current standard. We probably should look at getting it taken off ENA.

Answer 5 · 2016-04-15T19:57:18.000Z

http://www.ebi.ac.uk/ena/data/view/PRJEB13570