dphansti/mango

Step 3 fails with no bedpe file written; "Read names of PET ends do not match."

Closed this issue · 5 comments

Hello,

I am having an issue with step 3 of the pipeline. Following bowtie alignment, the pipeline fails to populate a bedpe file, and also a tagalign file because of the error:

"Error: read names of PET ends do not match"

From line 340 of the mango/mangoC.cpp code.

Thus, step 4 fails because the "tagAlign" file is empty.

From what I can tell, mango is trying to compare read names by going line-by-line through the "x_1.same.sam" and "x_2.same.sam" files. A unix "head" output on one of the "x_1.same.sam" files looks like this (there are no @sq headers?):

SRR2029844.1 4 * 0 0 * * 0 0 ATATCTTATCTGACTAAGAAATCCCGCTTCCAACGAAGGCCTCNNNNAAGTCTGAATATCCACTTGCAGACTTTACAAACAGAGTGTTTCCCAACTGCTC B3A0@E>DGGFGGGG1EBFE11EFFGGGBG1>11=:CFGGGEG####=0==C1EF1CG1@GCG>FGGGGG1FC@CG1CFGG:E:E>BFGG0:0:B::FCF XM:i:0
SRR2029844.3 4 * 0 0 * * 0 0 AGTTAGTCAGATAAGATATCGCGTATTTTTTATGGCTGCATAGTNNNCCATGGTGTATATGTGCCACATTTAATTAATCCAGTCTATCATTCTTGGACAT BCCCCGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG###==EFEGGGGGGGGGGGGGGGGGGGGGGGGGGCGFGEGGGGGGGGFGGGGGGGF XM:i:0
SRR2029844.6 4 * 0 0 * * 0 0 GCATAGGCAGAGCTCACACAAGCTGGAAACTGTGTGTTCTTAANNNNACGCGATATCTTATCTGACTTCCCCATGTACATGTGCTGCTGTTCCCCATAAA CCBCBGGGGGFGGGGGEGGGGEFGGEGGGGGEDGCFCDGCGEG####===FGGGFGGGGGFGGFCGGGGGGGGEGEGGGGGGCGGGCGGGFCEGGEGGGG XM:i:0
SRR2029844.5 4 * 0 0 * * 0 0 GGTTTGAATCCAGTTTGTTCTATGACCAACTCAAATGATTTTTCNNNCTCCCATCACCTTTTATGTGAGATATGAAGAAAAAAGGTGAGAGAAAAAGAGA BBABB>GGGGGGCGCD1CGGGGGGDCGGEDC###===<0:E:@1f>GGGGFC@C@FCC@1CGGCGGG>0EG>CFGG>G0ECGGGEB0 XM:i:0
SRR2029844.7 4 * 0 0 * * 0 0 GCATAGGGTTATGGAAGCAAAGTCAAAAGACCCTACTCAGGCGNNNNCTTATCTGACGTCAGATAAGATATCGCGTCTTGATGACTTATCTTACGTCGTC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG####===FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@ XM:i:0
SRR2029844.8 4 * 0 0 * * 0 0 GATATCGCGGTCAGATAAGATATCGCGTCCTTATGGCTTATGGNNNNTGAACGTTTTATTCTTTTAACTTGGTTGCTCTTCTTACAGGTAACCAACTGTT BBCCCGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGG####=00=@FGGGGGGEGGGGGFGG>GCGGGGGGGGFGGGFDFG>G8FGGGGGGGCG XM:i:0
SRR2029844.2 4 * 0 0 * * 0 0 ACACCACCCTCTAGGTAAATCTGGTTCAAGCAGGTGAGGGTGGCNNNAGGTGAGCACACACCCCACACGGCTAGGATGCCAGTGAGTGGGAAGCAGATAG CCCCCGGGGGGGGGG1CGCCCDGGG1FFCF1E10CFGGGF=9<:###::=ECGFGGGGGGGGGGGGBGDAGGGBE>BFEGDGEFFE0>FGG0FEGF8FGG XM:i:0
SRR2029844.4 4 * 0 0 * * 0 0 CAGTAGAGACATTTACTTATTTAACCCAGAAAAAAACTGAAATGNNNTTCTCCTCATTCATTCTTACCATCTCATTTTAACTCTTTCTACATTTTACAAA CCCCBGGGGGGGGGGGGGGGG>FGGGGGGGGGGGGGGGGGDGGG###<=EFGGGGGGGGGGGGGGGGEFGGGGGGGGGGGBGGGGGGGGGGGGGGEGGGC XM:i:0
SRR2029844.9 4 * 0 0 * * 0 0 CAGCAGAGGCTGATATCCAGCGTGTGCTCCATAAATGTCGTATANNNATAGCTGCCTTGATCTGCTAGCTTTGGGGTAAATTTTACATATGTCATTTAAT BBBBBGGGEGGGGGGGGGGGGGGGGGGGGGGGFGGG>FGGGGGG###==EFGFGGGGGGGDFGGGGGGGGGGGGGGGGEG@>CGGGGGGGGGGGGFEGEG XM:i:0
SRR2029844.10 4 * 0 0 * * 0 0 AGATTCTTCAAAATTAATTTCACGCGATATCTTATCTGACTATGNNNCATTTAAAATGTGTCTTCCTCAGCCAAGAGCAACAAAGTGAGACCAGCCAAGA @BBBCGGGCGGGEGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGF###==EFGGGGEGGGGGGGGDGGGGGFFGGGGFGGGGGGGGGGGGGGGGFGGFGGG XM:i:0


I am running on CentOS 6.4, bowtie 1.1.2, bedtools2, macs2, R 3.2.1, Python 2.7. Any help would be appreciated. I launched the pipeline with this command:

!/bin/bash

$ -cwd

$ -q UI

$ -pe smp 8

$ -m bea

$ -e stderr

$ -o stdout

$ -N mango

source module initialization

source /etc/profile.d/modules.sh

load needed R modules

module load R/3.2.1
module load python/2.7

tell R where to look for extra packages

export R_LIBS=~/mango/R_packages/

Rscript mango.R
--fastq1 /nfsscratch/Users/mchiment/mango/SRR2029844_1.fastq
--fastq2 /nfsscratch/Users/mchiment/mango/SRR2029844_2.fastq
--prefix SRR2029844
--outdir /nfsscratch/Users/mchiment/mango
--chromexclude chrM,chrY
--stages 1:5
--keepempty TRUE
--shortreads FALSE
--maxlength 1000
--bowtieref ~/mango/bowtie-1.1.2/indexes/hg19
--bedtoolsgenome ~/mango/bedtools2/genomes/human.hg19.genome
--bowtiepath ~/mango/bowtie-1.1.2/bowtie
--macs2path ~/mango/macs2

Yes. You are correct. Mango requires that Fastq files are perfectly matched. Every line in fq1 needs to correspond to the same read in fq2. If that is not the case Mango will produce this error. I have not witnessed this myself but I have been told of paired-end fastq files that are not perfectly matched and perhaps that is what is happening here. I would recommend extracting read names for each of the files to see if they match perfectly. Please respond here to let us know whether or not it was a problem with the fastq files.

Yes, the names match in the input fastq files. I believe the problem is that the alignment "SAM" files coming out of bowtie are not sorted. Thus, read "3" is not in the same position in both the _1 and _2 SAM files. How could I troubleshoot this?

I am aware of samtools "sort" functionality, unfortunately the SAM files being produced are not canonical SAM files (i.e., they have no headers), thus samtools fails to recognize them. They are also (in my case) enormous (~45G each).


Here are the 'heads' of the input fastq files:

@SRR2029844.1 WIGTC-HISEQ2:1:1212:2446:2225 length=100
ATATCTTATCTGACTAAGAAATCCCGCTTCCAACGAAGGCCTCNNNNAAGTCTGAATATCCACTTGCAGACTTTACAAACAGAGTGTTTCCCAACTGCTC
+SRR2029844.1 WIGTC-HISEQ2:1:1212:2446:2225 length=100
B3A0@E>DGGFGGGG1EBFE11EFFGGGBG1>11=:CFGGGEG####=0==C1EF1CG1@GCG>FGGGGG1FC@CG1CFGG:E:E>BFGG0:0:B::FCF
@SRR2029844.2 WIGTC-HISEQ2:1:1212:1722:2225 length=100
ACACCACCCTCTAGGTAAATCTGGTTCAAGCAGGTGAGGGTGGCNNNAGGTGAGCACACACCCCACACGGCTAGGATGCCAGTGAGTGGGAAGCAGATAG
+SRR2029844.2 WIGTC-HISEQ2:1:1212:1722:2225 length=100
CCCCCGGGGGGGGGG1CGCCCDGGG1FFCF1E10CFGGGF=9<:###::=ECGFGGGGGGGGGGGGBGDAGGGBE>BFEGDGEFFE0>FGG0FEGF8FGG
@SRR2029844.3 WIGTC-HISEQ2:1:1212:2069:2226 length=100
AGTTAGTCAGATAAGATATCGCGTATTTTTTATGGCTGCATAGTNNNCCATGGTGTATATGTGCCACATTTAATTAATCCAGTCTATCATTCTTGGACAT

@SRR2029844.1 WIGTC-HISEQ2:1:1212:2446:2225 length=100
GGTGAACGATCCTTTACACAGAGCAGACTTGAAACACTCNTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGTTCANNNNNNNNNNNNNNNN
+SRR2029844.1 WIGTC-HISEQ2:1:1212:2446:2225 length=100
A3:<A110EFGGGEGGF111@C:F11F11>::=F1C01>#0=<:C1>/1?F@1EFGDGGGG@dc1:FCFGGG/<F101E1?################
@SRR2029844.2 WIGTC-HISEQ2:1:1212:1722:2225 length=100
CTCCTGACCTCAAGTGATCCAAGTCAGATAAGATATCGCNGGAGATGTTAGTGGGCAGTTAACTTTTTATACATGTAAATAGAANNNNNNNNNNNNNNNN
+SRR2029844.2 WIGTC-HISEQ2:1:1212:1722:2225 length=100
3A3<A@=@BF1EGGGGGG>@=CFFBGGB1FB:1F1FF>D#=000EFFD:@ce=FDGBGGGCFGGGFGB11?FGGE1F1FC##################
@SRR2029844.3 WIGTC-HISEQ2:1:1212:2069:2226 length=100
ACAAAGATTTGGAACCAACCCAAATGTCCAAGAATGATANACTGGATTAATTAAATGTGGCACATATACACCATGGAATACTATNNNNNNNNNNNNNNNN


Hmmm. The sam files should still be in the same order as the fastq files. And yes, they are deliberately output without @ lines so that the lines match up perfectly. I have not seen this behavior before.

Can you post the 'heads' of your 'same.fastq' and '.sam' files as well?

Hi mchimenti,

I also got this error initially. I have downloaded the raw data using the SRAtoolkit from NCBI. However, I noticed later that the fastq files generated from the "fastq-dump" command have issues with mango.

Later, I tried downloading the raw data from the European Nucleotide Archive (ENA) and then analyzing it using mango. It worked perfectly fine and that stage completed successfully.

I don't know if this can be of some help to you.

Best Regards,

Hello:
I tried downloading the same dataset from the ENA and the pipeline did work without issue. That was a very helpful suggestion!!