Mosquito short read alignment pipeline - validation
alimanfoo opened this issue · 13 comments
Issue for discussion of work to compare outputs of the short read alignment pipeline with expected outputs from previous pipeline implementation.
Surfacing info from @gbggrant:
$ samtools idxstats test.AN0131-C.bam | sort
* 0 0 113112
2L 49364325 8046719 37697
2R 61545105 9509957 43065
3L 41963435 6640126 37168
3R 53200684 8478808 46014
Mt 15363 306422 123
UNKN 42389979 7699610 26255
X 24393108 2701566 32356
Y_unplaced 237045 2605961 1746
$ samtools idxstats truth.AN0131C.bam | sort
* 0 0 4665880
2L 49364325 6869952 87343
2R 61545105 8149276 101376
3L 41963435 5590331 80789
3R 53200684 7027770 93925
UNKN 42389979 6796285 422736
X 24393108 2114527 69684
Y_unplaced 237045 2504227 70917
$ samtools flagstat test.AN0131-C.bam
46326705 + 0 in total (QC-passed reads + QC-failed reads)
1681687 + 0 secondary
0 + 0 supplementary
2442902 + 0 duplicates
45989169 + 0 mapped (99.27% : N/A)
44645018 + 0 paired in sequencing
22322509 + 0 read1
22322509 + 0 read2
41030516 + 0 properly paired (91.90% : N/A)
44083058 + 0 with itself and mate mapped
224424 + 0 singletons (0.50% : N/A)
1959244 + 0 with mate mapped to a different chr
890066 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools flagstat truth.AN0131C.bam
44645018 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
3274737 + 0 duplicates
39052368 + 0 mapped (87.47% : N/A)
44645018 + 0 paired in sequencing
22322509 + 0 read1
22322509 + 0 read2
37112858 + 0 properly paired (83.13% : N/A)
38125598 + 0 with itself and mate mapped
926770 + 0 singletons (2.08% : N/A)
515812 + 0 with mate mapped to a different chr
247322 + 0 with mate mapped to a different chr (mapQ>=5)
Some quick stats ^^
‘test’ is the output of the new pipeline. ‘truth’ is the expected output.
Interesting? that the new pipeline aligned to Mt where the old did not.
Interesting? that the new pipeline aligned to Mt where the old did not.
That's expected, we previously aligned to the AgamP3 reference which is exactly the same as AgamP4 expect it doesn't have the Mt.
Hm, stats are totally different. Not sure where to start investigating this. Even the total number of reads is different.
This could be down to differences in the input data, at least in part. Maybe I made a mistake somewhere in pointing to the lanelets to use from ENA. I'll double check that.
However, another very noticeable difference is the fraction of mapped reads, truth has 87% whereas test has 99%. I'm slightly surprised to get 99% of reads mapping in any sample.
@gbggrant could you post idxstats and flagstat comparisons for the other test sample? Curious to see if that also mismatches in a similar way.
@alimanfoo Here are the idxstats and flagstat comparisons for AB0252_C:
samtools idxstats test.AB0252-C.bam | sort
* 0 0 108628
2L 49364325 8584887 51411
2R 61545105 10375873 47355
3L 41963435 7155320 41603
3R 53200684 9046717 49025
Mt 15363 429562 165
UNKN 42389979 6957552 24774
X 24393108 4673483 47190
Y_unplaced 237045 39865 585
samtools idxstats truth.AB0252_C.bam | sort
* 0 0 6111882
2L 49364325 6469850 99498
2R 61545105 8852164 115212
3L 41963435 5972921 89995
3R 53200684 7474801 106419
UNKN 42389979 5962884 445034
X 24393108 3837120 89736
Y_unplaced 237045 15221 4887
samtools flagstat test.AB0252-C.bam
47633995 + 0 in total (QC-passed reads + QC-failed reads)
1986371 + 0 secondary
0 + 0 supplementary
322316 + 0 duplicates
47263259 + 0 mapped (99.22% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
41953716 + 0 properly paired (91.91% : N/A)
45014780 + 0 with itself and mate mapped
262108 + 0 singletons (0.57% : N/A)
2051784 + 0 with mate mapped to a different chr
932352 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flagstat truth.AB0252_C.bam
45647624 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
864369 + 0 duplicates
38584961 + 0 mapped (84.53% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
36864882 + 0 properly paired (80.76% : N/A)
37634180 + 0 with itself and mate mapped
950781 + 0 singletons (2.08% : N/A)
422608 + 0 with mate mapped to a different chr
215717 + 0 with mate mapped to a different chr (mapQ>=5)
Thanks @gbggrant, I now realise that I have made a snafu here, my apologies. The "truth" BAMs I pointed you to in fact come from a previous version of the pipeline which used bwa aln/sampe
and not the more recent version which used bwa mem
and should match your implementation. That would explain the big different in the percentage of reads mapped, I believe bwa mem
is a much more sensitive aligner.
The correct truth BAM files for these two samples (AB0252-C, AN0131-C.bam) are on lustre at Sanger, I'll put them somewhere publicly accessible and update the fixture here on github.
Here's some stats for the correct truth BAMs (located here at Sanger: /lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/output/vo_agam_indelrealign/
) ...
$ samtools flagstat AB0252-C.bam
47633790 + 0 in total (QC-passed reads + QC-failed reads)
639447 + 0 duplicates
47263055 + 0 mapped (99.22%:-nan%)
47633790 + 0 paired in sequencing
23794474 + 0 read1
23839316 + 0 read2
42918707 + 0 properly paired (90.10%:-nan%)
46947793 + 0 with itself and mate mapped
315262 + 0 singletons (0.66%:-nan%)
3347291 + 0 with mate mapped to a different chr
1138823 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools idxstats AB0252-C.bam
2R 61545105 10375637 47276
3R 53200684 9047045 48995
2L 49364325 8585966 51444
UNKN 42389979 6955138 24858
3L 41963435 7155615 41609
X 24393108 4674250 47189
Y_unplaced 237045 39829 567
Mt 15363 429575 169
* 0 0 108628
$ samtools flagstat AN0131-C.bam
46326540 + 0 in total (QC-passed reads + QC-failed reads)
3261375 + 0 duplicates
45989009 + 0 mapped (99.27%:-nan%)
46326540 + 0 paired in sequencing
23103905 + 0 read1
23222635 + 0 read2
41802625 + 0 properly paired (90.23%:-nan%)
45721722 + 0 with itself and mate mapped
267287 + 0 singletons (0.58%:-nan%)
3038022 + 0 with mate mapped to a different chr
1059942 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools idxstats AN0131-C.bam
2R 61545105 9509422 43199
3R 53200684 8478536 45918
2L 49364325 8047104 37656
UNKN 42389979 7699440 26403
3L 41963435 6641715 37019
X 24393108 2700347 32358
Y_unplaced 237045 2606018 1742
Mt 15363 306427 124
* 0 0 113112
Statistics comparison now looking much better 😄
Thanks. It's too late now, but I should have posted the @pg record from the previously referenced 'truth' bam, which shows the usage of bwa aln and other differing tools:
@PG ID:bam_calculate_bq2 PN:samtools PP:gatk_indel_realigner_gatk22 VN:0.1.18-r572 CL:samtools calmd -Erb $bam_file $reference_fasta > $bq_bam_file
@PG ID:bam_fix_bam PN:picard PP:sam_to_fixed_bam VN:1.96 CL:java $jvm_args -jar CleanSam.jar INPUT=$bam_file OUTPUT=$fixed_bam_file VALIDATION_STRINGENCY=SILENT
@PG ID:bam_fix_mates PN:picard PP:bam_fix_bam VN:1.96 CL:java $jvm_args -jar FixMateInformation.jar INPUT=$bam_file OUTPUT=$fixmate_bam_file SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=0
@PG ID:bwa_aln_fastq PN:bwa PP:bwa_index VN:0.6.2-r126 CL:bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file
@PG ID:bwa_index PN:bwa VN:0.6.2-r126 CL:bwa index -a bwtsw $reference_fasta
@PG ID:bwa_sam PN:bwa PP:bwa_aln_fastq VN:0.6.2-r126 CL:bwa sampe -a 1500 -r $rg_line -f $sam_file $reference_fasta $sai_file(s) $fastq_file(s)
@PG ID:gatk_indel_realigner_gatk22 PN:GenomeAnalysisTK PP:gatk_realigner_target_creator_gatk22 VN:2.4-9-g532efad CL:java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -LOD 0.4 -model USE_SW -compress 0
@PG ID:gatk_left_align_indels_gatk2 PN:GenomeAnalysisTK PP:bam_fix_mates VN:2.4-9-g532efad CL:java $jvm_args -jar GenomeAnalysisTK.jar -T LeftAlignIndels -R $ref -I $bam_files -o $indels_left_aligned_bam_file
@PG ID:gatk_realigner_target_creator_gatk22 PN:GenomeAnalysisTK PP:gatk_left_align_indels_gatk2 VN:2.4-9-g532efad CL:java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -I $input_bam -o $intervals_file
@PG ID:sam_to_fixed_bam PN:samtools PP:bwa_sam VN:0.1.18-r572 CL:samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file
@PG ID:GATK IndelRealigner VN:2.4-9-g532efad CL:knownAlleles=[] targetIntervals=/lustre/scratch110/malaria/ag1000g/output/c/3/4/b/92899/1_gatk_realigner_target_creator_gatk2/pe.1.bam.intervals LODThresholdForCleaning=0.4 consensusDeterminationModel=USE_SW entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null```
Belatedly pointing to the outputs of runs of AN0131-C and AB0252-c through the Short Read Alignment Pipeline:
AN0131-C has cromwell id: 99d5d6b4-4307-4eca-9ef5-119f55051692, it's workflow execution directory is: /lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692
AB0252-C has cromwell id: fa0f695b-f043-4d20-95d1-1ae6aa684882, it's workflow execution directory is:
/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882
Outputs from the pipeline for AN0131-C:
"outputs": { "ShortReadAlignment.indel_realigner_interval_list_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-RealignerTargetCreator/cacheCopy/execution/AN0131-C.intervals", "ShortReadAlignment.callable_loci_summary_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-GatkCallableLoci/cacheCopy/execution/AN0131-C.gatk_callable_loci.summary.txt", "ShortReadAlignment.samtools_stats_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-SamtoolsStats/execution/AN0131-C.samtools_stats_report.txt", "ShortReadAlignment.validate_samfile_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ValidateSamFile/cacheCopy/execution/AN0131-C.validation_report.txt", "ShortReadAlignment.output_sams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-0/cacheCopy/execution/AN0131-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-1/cacheCopy/execution/AN0131-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-2/cacheCopy/execution/AN0131-C.sam" ], "ShortReadAlignment.output_bam": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-IndelRealigner/cacheCopy/execution/AN0131-C.bam", "ShortReadAlignment.output_bams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-0/cacheCopy/execution/AN0131-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-1/cacheCopy/execution/AN0131-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-2/cacheCopy/execution/AN0131-C.bam" ] },
Outputs from the pipeline for AB0252-C:
"outputs": { "ShortReadAlignment.indel_realigner_interval_list_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-RealignerTargetCreator/execution/AB0252-C.intervals", "ShortReadAlignment.callable_loci_summary_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-GatkCallableLoci/execution/AB0252-C.gatk_callable_loci.summary.txt", "ShortReadAlignment.samtools_stats_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-SamtoolsStats/execution/AB0252-C.samtools_stats_report.txt", "ShortReadAlignment.validate_samfile_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ValidateSamFile/execution/AB0252-C.validation_report.txt", "ShortReadAlignment.output_sams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-0/execution/AB0252-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-1/execution/AB0252-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-2/execution/AB0252-C.sam" ], "ShortReadAlignment.output_bam": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-IndelRealigner/execution/AB0252-C.bam", "ShortReadAlignment.output_bams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-0/execution/AB0252-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-1/execution/AB0252-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-2/execution/AB0252-C.bam" ] },
Let me know if you'd like me to copy them elsewhere.
Closing as complete.