jiantao/Tangram

Tangram_scan segmentation fault using BWA aligned reads

Opened this issue · 2 comments

Hi,

I would like to use tamgram to identify the location of transposable elements in Drosophila, however when I run tangram_scan I get a segmentation fault. I suspect that tangram_bam is not working, as it looks like the ZA headers are empty (I think?). However, I know that my strains should be heterozygous for a number of different transposable elements, and in fact there are already estimates of the locations. I'd rather not run Mosaik, so if there is a way to get tangram_bam to work that would be nice.

I've put below info about what I'm doing. Thanks!
Zoe

As a positive control, I know, for example, that there should be at least 78 copies of INE_1 heterozygous in my strain, which I know from previous work. E.g., I have this data:
te presence ch Upstream_estimate Downstream_estimate
INE-1:TIR:DNA yes 2R 2496555 2497124
INE-1:TIR:DNA yes 4 286454 287918
INE-1:TIR:DNA yes 3L 17862241 17862700

I can get a copy of INE_1 sequence from flybase (transposon_sequence_set.embl.txt), so I make my moblist file, which contains only:

moblist_INE-1 GA(transposon_sequence_set.embl.txt) SN(Drosophila melanogaster)
tatacccgttactagattcgttgaaatgaatgtaacaggcagaaggaagcgtcttagaccatatatagtatatacatacatgtatattcttgatcaggatcaatagccgagtcgatcttgccatatccgtctgtccgtatgaacgtcgagatctcaggaactataaaagctagaaggtttagattcagcatacagagacaaagacgcaagtagccatgcccactctaacgtccacaaacagcgcaaaactatcacgcccacacttttgaaaaatgtgttgttcttttcacattctgattagtcttttacatttctatcgatttccaaaaaaaaactttttgccaacgccctaaaaccgcccaaaactccgacacccacatttgtaaaaaattgttgggaatttttttcataaatttattagtttattatttattataaatttaagtttatatcgatttgccgacaacatattttaattttttttctcattttatcttttatctatcgatatcccagaaaaattgtgcaatttcgcattcacactagctgagtaacgggtatctgatagtcgggaaactcgactatagcattctctctttttgaaattgcgg

I generate my bam file with bwa, with the option -a to keep reads which only have 1 of the pair map to the genome (since this appears necessary for tangram?). These are the command line options I use:
bwa mem -M -a -R

Then I remove duplicates and sort and index using PIcardTools. I also merge several bams together, because I have a single sample which was used to generate several libraries. Then with that merged bam I run tangram_bam:
mySoftwarePath/Tangram/bin/tangram_bam -i myDataPath/MA_6.merged.dedup.bam -r myDataPath/moblist_ine_only.fasta -o myDataPath/MA_6.merged.dedup.tangram.bam

And sort the resulting stuff
mySoftwarePath/java -Xmx2g -jar mySoftwarePath/picard-tools-1.105/SortSam.jar INPUT=myDataPath/MA_6.merged.dedup.tangram.bam OUTPUT=myDataPath/MA_6.merged.dedup.sorted.tangram.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

Now generate my file list tangramBamList.txt, which contains only:
myDataPath/MA_6.merged.dedup.sorted.tangram.bam

Now do tangram_scan:
mySoftwarePath/Tangram/bin/tangram_scan -in myDataPath/tangramBamList.txt -dir myDataPath/tangramOut

And I get the error:
Segmentation fault (core dumped)

This is what a sample of what my bam file looks like:
D4LHBFN1:293:C3L3LACXX:2:2213:20193:18303 107 YHet 1 60 16S48M2S = 1 38 CTACGGTTGTCTCAGCAGGGTCACGTAATGCTGATCCAGTCTTGTTTTTATTTTCATTCATGTTGT BHGHIIIIG@HGG
GDGIIGI:BDFHDFEGGG<FGHGIIIBHHFHCDHIIGHIFEHFHFFEDE?CCE PG:Z:MarkDuplicates RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2 NM:i:0 AS:i:48 XS:i:0 ZA:Z:<@;60;;;1;;><&;60;;;1;;>
D4LHBFN1:293:C3L3LACXX:2:2213:20193:18303 151 YHet 1 60 28S38M = 1 -38 ATATGGTGTTTCCTACGGTTGTCTCCGCAGGGTCACGTAATGCTGATCCAGTCTTGTTTTTATTTT CDCDDDCADDDBDDDDFFHEH
HB;-'GHGGDHDB2HBIGGHCGCEGIJJJJJJIJIJJJJJIHEBA PG:Z:MarkDuplicates RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2 NM:i:0 AS:i:38 XS:i:0 ZA:Z:<&;60;;;1;;><@;60;;;1;;>
D4LHBFN1:293:C3L3LACXX:2:2313:15215:43919 147 YHet 10 60 83M = 21 -72 TAATGCTGATCCAGTCTTGTTTTTATTTTCATTCATGTTGTTGCTCTTGCTTTGATTCCGACTTCTAACGTTTAACCTGTGAT DDDDD
DDDDDDCCDDEDDDFFFFFFGHHHHHJJJJJJJJJJJIJJJJIJHJIIIIJJJJHHJIJIIJJJJJJJHHJJIJJJII PG:Z:MarkDuplicates.3 RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2.3 NM:i:3 AS:i:68 XS:i:20 ZA:Z:<&;60;;;1;;><@;60;;;1;;>
D4LHBFN1:293:C3L3LACXX:2:2314:3464:7166 99 YHet 17 60 82M = 56 122 GATCCAGTCTTGTTTTTATTTTCATTCATGTTGTTGCTCTTGCTTTGATTCCGACTTCTAACGTTTAACCTGTGATCAGACG AEDHGGIEFHHHH
HIICAGGIIFE>DFDHHHGEHHIIIG@FGGGGIIIIIG@HIHIIIGHFHGEFFFF@@EECEEA;>CCCC PG:Z:MarkDuplicates.1 RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2.1 NM:i:3 AS:i:67 XS:i:20 ZA:Z:<@;60;;;1;;><&;60;;;1;;>
D4LHBFN1:293:C3L3LACXX:2:2313:15215:43919 99 YHet 21 60 81M = 10 72 CAGTCTTGTTTTTATTTTCATTCATGTTGTTGCTCTTGCTTTGATTCCGACTTCTAACGTTTAACCTGTGATCAGACGTTT JIJHH

inti commented

Similar in here,
after running

gkno tangram-bam --in bams/93-968.bam --mobile-element-fasta repeats/test_me.fa --out 93-968.tangram.bam --region Chr19 

i get the segmentation fault error

sh-4.2$ /home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_scan -in /home/ipedroso/ANALYSES/MEI/Populus/file_list.text -dir tangram_out 
Violación de segmento

from the bam file header

@PG     ID:bwa  PN:bwa  VN:0.5.9-r16
@PG     ID:tangram_bam  CL:/home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_bam --ref repeats/test_me.fa --input bams/93-968.bam --target-ref-name Chr19 --output /home/ipedroso/ANALYSES/MEI/Populus/93-968_ZA.bam

I have not tried re-aligning this data using MOSAIK.

I have also observed seg faults running on bwa data and am not sure what
the cause of the problem is. If you don't have massive amounts of data, I
would recommend aligning with Mosaik since this is what Tangram was
designed to work with. If you need any assistance, please let me know (
AlistairNWard@gmail.com) and I can help getting Mosaik alignments and
tangram run. In particular, we have a pipeline system (gkno) that helps
running larger pipelines and also makes it possible to build your own
pipelines for running repeated / similar analyses.

On Wed, Sep 16, 2015 at 1:36 PM, Inti Pedroso notifications@github.com
wrote:

Similar in here,
after running

gkno tangram-bam --in bams/93-968.bam --mobile-element-fasta repeats/test_me.fa --out 93-968.tangram.bam --region Chr19

i get the segmentation fault error

sh-4.2$ /home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_scan -in /home/ipedroso/ANALYSES/MEI/Populus/file_list.text -dir tangram_out
Violación de segmento

from the bam file header

@pg ID:bwa PN:bwa VN:0.5.9-r16
@pg ID:tangram_bam CL:/home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_bam --ref repeats/test_me.fa --input bams/93-968.bam --target-ref-name Chr19 --output /home/ipedroso/ANALYSES/MEI/Populus/93-968_ZA.bam

I have not tried re-aligning this data using MOSAIK.


Reply to this email directly or view it on GitHub
#5 (comment).