fulcrumgenomics/fgbio

ZipperBams failed. Error: processed all unmapped reads but there are mapped reads remaining to be read.

teryyoung opened this issue · 3 comments

hello! I'm doing with sequencing data with UMI, I followed #fgbio-best-practise-fastq---consensus-pipeline
When I run fgbio ZipperBams to merge my consensus.ubam and consensus.mapped.bam, there is an exception:
Exception in thread "main" java.lang.IllegalStateException: Error: processed all unmapped reads but there are mapped reads remaining to be read.

and I check my two bam files; they all are queryname sorted, have same number of reads, the difference between them is all of queryname of ubam have suffix "/A". like this:
image
image
I found it is caused by fgbio CallConsensusReads...

I wanna know whether this is the problem caused ZipperBams failed, or how to deal with this problem(ZipperBams)?
#my fgbio version is 2.2.2-3a74fd2-SNAPSHOT

nh13 commented

Somehow the /A is getting lost between the uBAM and the mapped BAM. I see your filename has picard in it, but we don't use picard in the best-practices pipeline. Can you provide the commands you used to go from the uBAM to the mapped BAM?

I used STAR to map the uBAM, so the code looks weird,
1)First I FastqtoBam the raw reads, and picard sorted it;
java -Xmx10g -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar FastqToBam \
--input $r1 $r2 \
--read-structures 5M2S+T 5M2S+T \
--sample ${sample} \
--library ${sample} \
--platform-unit BGIT7 \
--output ./0.1.uBAM/${sample}.uBAM
java -Xmx10G -jar /share/apps/softwares/picard-2.26.2/picard.jar SortSam I=0.1.uBAM/${sample}.uBAM \
O=0.1.uBAM/${sample}.picard.sorted.uBAM\
SORT_ORDER=queryname

2)then I extract fastq from uBAM, mapped them using STAR, forgive me that I didn't find ways to mapping one fastq file using STAR.
samtools fastq 0.1.uBAM/${sample}.picard.sorted.uBAM -1 tmp/${sample}.1.fq.bak -2 tmp/${sample}.2.fq.bak

awk 'NR%4==1 {print $0"/1"} NR%4 !=1 {print $0}' tmp/${sample}.1.fq.bak >tmp/${sample}.uBAM_1.fq
awk 'NR%4==1 {print $0"/2"} NR%4 !=1 {print $0}' tmp/${sample}.2.fq.bak >tmp/${sample}.uBAM_2.fq

ref_dir="~/reference_genome/GRC38_p14/hg38_ucsc/STAR_index/"
rm tmp/${sample}/ -rf
/share/apps/softwares/STAR-2.7.10b/bin/Linux_x86_64_static/STAR --twopassMode Basic\
--quantMode TranscriptomeSAM GeneCounts\
--runThreadN 6\
--genomeDir ${ref_dir}\
--alignIntronMin 20\
--alignIntronMax 50000\
--outSAMtype BAM Unsorted\
--outSAMattrRGline ID:"${sample}" SM:"${sample}" PL:BGI\
--outFilterMismatchNmax 2\
--outSJfilterReads Unique --outSAMmultNmax 1\
--outFileNamePrefix ./0.1.uBAM/${sample}/${sample}.\
--outSAMmapqUnique 60\
--outSAMunmapped Within KeepPairs\
--readFilesCommand cat\
--readFilesIn tmp/${sample}.uBAM_1.fq tmp/${sample}.uBAM_2.fq\
--outTmpDir ./tmp/${sample}
3)After first mapping, I sorted result bam file using picard, and ZipperBams;
java -Xmx10G -jar /share/apps/softwares/picard-2.26.2/picard.jar SortSam I=0.1.uBAM/${sample}/${sample}.Aligned.out.bam\
O=0.1.uBAM/${sample}/${sample}.mapped.picard.sorted.BAM\
SORT_ORDER=queryname
java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar ZipperBams \
--unmapped 0.1.uBAM/${sample}.picard.sorted.uBAM \
--input 0.1.uBAM/${sample}/${sample}.mapped.picard.sorted.BAM \
--ref ${ref}/../hg38.p14.fa \
--output 0.1.uBAM/${sample}.umi.mapped.bam

All commands above runs fine, but After I GroupReadsByUmi, CallMolecularConsensusReads, FilterConsensusReads, and mapping them sencond time, ZipperBams have problem;
4)java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar GroupReadsByUmi \
--input=${dir01}/1.umi.mapped.bam \
--output=${dir02}/1.umi.group.BAM \
--strategy=paired \
--min-map-q=10 \
--edits=1 \
--raw-tag=RX \ --family-size-histogram=${dir02}/1.umi.tag_family_histogram.txt`

java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar CallMolecularConsensusReads \
--min-reads=1 \
--min-input-base-quality=20 \
--input=${dir02}/1.umi.group.BAM \
--output=${dir03}/1.consensus.uBAM \
--threads=4 \
--read-group-id=false

java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar FilterConsensusReads \
--input=${dir03}/1.consensus.uBAM \
--output=${dir03}/1.consensus.filter.uBAM \
--ref="${ref}/../hg38.p14.fa" --min-reads=3 \
--max-read-error-rate=0.2 \
--max-base-error-rate=0.2 \
--min-base-quality=30 \
--max-no-call-fraction=0.20

java -Xmx10G -jar /share/apps/softwares/picard-2.26.2/picard.jar SortSam \
I=${dir03}/1.consensus.filter.uBAM\
O=${dir03}/1.consensus.filter.sorted.uBAM\
SORT_ORDER=queryname

samtools fastq ./${dir03}/1.consensus.filter.sorted.uBAM -1 tmp/1.consensus.uBAM_1.fq.bak -2 tmp/1.consensus.uBAM_2.fq.bak
awk 'NR%4 ==1 {print $0"/1"} NR%4 !=1 {print$0}' tmp/1.consensus.uBAM_1.fq.bak >tmp/1.consensus.uBAM_1.fq
awk 'NR%4==1 {print $0"/2"} NR%4!=1 {print$0}' tmp/1.consensus.uBAM_2.fq.bak >tmp/1.consensus.uBAM_2.fq

STAR alignment code is like it above, and then picard sorted;
then ZipperBams;
java -Xmx10G -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar ZipperBams \
--unmapped ./${dir03}/1.consensus.filter.sorted.uBAM \
--ref ${ref}/../hg38.p14.fa \
--input ${dir03}/1.consensus.mapped.picard.sorted.bam \
--output ${dir03}/1.consensus.mapped.zipper.bam

Tips: using picard sortBam is because I used to use picard mergeBam to merge unmapped and mapped reads; I know the sort_order need to be the same when using picard mergeBam; (why I didnt use samtools sort is that some BAM files didnt support samtools's sort somehow.)

Did you mean that '/A' is necessary for ZipperBams? And that should be retain whatever I run command on it?
now I think my fault is misuse the STAR alignment.