malariagen/pipelines

Mosquito short read alignment pipeline - validation

alimanfoo opened this issue · 13 comments

Issue for discussion of work to compare outputs of the short read alignment pipeline with expected outputs from previous pipeline implementation.

Surfacing info from @gbggrant:


$ samtools idxstats test.AN0131-C.bam | sort
*	0	0	113112
2L	49364325	8046719	37697
2R	61545105	9509957	43065
3L	41963435	6640126	37168
3R	53200684	8478808	46014
Mt	15363	306422	123
UNKN	42389979	7699610	26255
X	24393108	2701566	32356
Y_unplaced	237045	2605961	1746
$ samtools idxstats truth.AN0131C.bam | sort
*	0	0	4665880
2L	49364325	6869952	87343
2R	61545105	8149276	101376
3L	41963435	5590331	80789
3R	53200684	7027770	93925
UNKN	42389979	6796285	422736
X	24393108	2114527	69684
Y_unplaced	237045	2504227	70917
$ samtools flagstat test.AN0131-C.bam 
46326705 + 0 in total (QC-passed reads + QC-failed reads)
1681687 + 0 secondary
0 + 0 supplementary
2442902 + 0 duplicates
45989169 + 0 mapped (99.27% : N/A)
44645018 + 0 paired in sequencing
22322509 + 0 read1
22322509 + 0 read2
41030516 + 0 properly paired (91.90% : N/A)
44083058 + 0 with itself and mate mapped
224424 + 0 singletons (0.50% : N/A)
1959244 + 0 with mate mapped to a different chr
890066 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools flagstat truth.AN0131C.bam 
44645018 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
3274737 + 0 duplicates
39052368 + 0 mapped (87.47% : N/A)
44645018 + 0 paired in sequencing
22322509 + 0 read1
22322509 + 0 read2
37112858 + 0 properly paired (83.13% : N/A)
38125598 + 0 with itself and mate mapped
926770 + 0 singletons (2.08% : N/A)
515812 + 0 with mate mapped to a different chr
247322 + 0 with mate mapped to a different chr (mapQ>=5)

Some quick stats ^^

‘test’ is the output of the new pipeline. ‘truth’ is the expected output.

Interesting? that the new pipeline aligned to Mt where the old did not.

Interesting? that the new pipeline aligned to Mt where the old did not.

That's expected, we previously aligned to the AgamP3 reference which is exactly the same as AgamP4 expect it doesn't have the Mt.

Hm, stats are totally different. Not sure where to start investigating this. Even the total number of reads is different.

This could be down to differences in the input data, at least in part. Maybe I made a mistake somewhere in pointing to the lanelets to use from ENA. I'll double check that.

However, another very noticeable difference is the fraction of mapped reads, truth has 87% whereas test has 99%. I'm slightly surprised to get 99% of reads mapping in any sample.

@gbggrant could you post idxstats and flagstat comparisons for the other test sample? Curious to see if that also mismatches in a similar way.

@alimanfoo Here are the idxstats and flagstat comparisons for AB0252_C:

samtools idxstats test.AB0252-C.bam | sort
*	0	0	108628
2L	49364325	8584887	51411
2R	61545105	10375873	47355
3L	41963435	7155320	41603
3R	53200684	9046717	49025
Mt	15363	429562	165
UNKN	42389979	6957552	24774
X	24393108	4673483	47190
Y_unplaced	237045	39865	585
samtools idxstats truth.AB0252_C.bam | sort
*	0	0	6111882
2L	49364325	6469850	99498
2R	61545105	8852164	115212
3L	41963435	5972921	89995
3R	53200684	7474801	106419
UNKN	42389979	5962884	445034
X	24393108	3837120	89736
Y_unplaced	237045	15221	4887
samtools flagstat test.AB0252-C.bam
47633995 + 0 in total (QC-passed reads + QC-failed reads)
1986371 + 0 secondary
0 + 0 supplementary
322316 + 0 duplicates
47263259 + 0 mapped (99.22% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
41953716 + 0 properly paired (91.91% : N/A)
45014780 + 0 with itself and mate mapped
262108 + 0 singletons (0.57% : N/A)
2051784 + 0 with mate mapped to a different chr
932352 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flagstat truth.AB0252_C.bam
45647624 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
864369 + 0 duplicates
38584961 + 0 mapped (84.53% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
36864882 + 0 properly paired (80.76% : N/A)
37634180 + 0 with itself and mate mapped
950781 + 0 singletons (2.08% : N/A)
422608 + 0 with mate mapped to a different chr
215717 + 0 with mate mapped to a different chr (mapQ>=5)

Thanks @gbggrant, I now realise that I have made a snafu here, my apologies. The "truth" BAMs I pointed you to in fact come from a previous version of the pipeline which used bwa aln/sampe and not the more recent version which used bwa mem and should match your implementation. That would explain the big different in the percentage of reads mapped, I believe bwa mem is a much more sensitive aligner.

The correct truth BAM files for these two samples (AB0252-C, AN0131-C.bam) are on lustre at Sanger, I'll put them somewhere publicly accessible and update the fixture here on github.

Here's some stats for the correct truth BAMs (located here at Sanger: /lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/output/vo_agam_indelrealign/) ...

$ samtools flagstat AB0252-C.bam
47633790 + 0 in total (QC-passed reads + QC-failed reads)
639447 + 0 duplicates
47263055 + 0 mapped (99.22%:-nan%)
47633790 + 0 paired in sequencing
23794474 + 0 read1
23839316 + 0 read2
42918707 + 0 properly paired (90.10%:-nan%)
46947793 + 0 with itself and mate mapped
315262 + 0 singletons (0.66%:-nan%)
3347291 + 0 with mate mapped to a different chr
1138823 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools idxstats AB0252-C.bam
2R	61545105	10375637	47276
3R	53200684	9047045	48995
2L	49364325	8585966	51444
UNKN	42389979	6955138	24858
3L	41963435	7155615	41609
X	24393108	4674250	47189
Y_unplaced	237045	39829	567
Mt	15363	429575	169
*	0	0	108628
$ samtools flagstat AN0131-C.bam
46326540 + 0 in total (QC-passed reads + QC-failed reads)
3261375 + 0 duplicates
45989009 + 0 mapped (99.27%:-nan%)
46326540 + 0 paired in sequencing
23103905 + 0 read1
23222635 + 0 read2
41802625 + 0 properly paired (90.23%:-nan%)
45721722 + 0 with itself and mate mapped
267287 + 0 singletons (0.58%:-nan%)
3038022 + 0 with mate mapped to a different chr
1059942 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools idxstats AN0131-C.bam
2R	61545105	9509422	43199
3R	53200684	8478536	45918
2L	49364325	8047104	37656
UNKN	42389979	7699440	26403
3L	41963435	6641715	37019
X	24393108	2700347	32358
Y_unplaced	237045	2606018	1742
Mt	15363	306427	124
*	0	0	113112

Also submitted PR #23 to correct the fixture data in this repo.

Statistics comparison now looking much better 😄

Thanks. It's too late now, but I should have posted the @pg record from the previously referenced 'truth' bam, which shows the usage of bwa aln and other differing tools:

@PG	ID:bam_calculate_bq2	PN:samtools	PP:gatk_indel_realigner_gatk22	VN:0.1.18-r572	CL:samtools calmd -Erb $bam_file $reference_fasta > $bq_bam_file
@PG	ID:bam_fix_bam	PN:picard	PP:sam_to_fixed_bam	VN:1.96	CL:java $jvm_args -jar CleanSam.jar INPUT=$bam_file OUTPUT=$fixed_bam_file VALIDATION_STRINGENCY=SILENT
@PG	ID:bam_fix_mates	PN:picard	PP:bam_fix_bam	VN:1.96	CL:java $jvm_args -jar FixMateInformation.jar INPUT=$bam_file OUTPUT=$fixmate_bam_file SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=0
@PG	ID:bwa_aln_fastq	PN:bwa	PP:bwa_index	VN:0.6.2-r126	CL:bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file
@PG	ID:bwa_index	PN:bwa	VN:0.6.2-r126	CL:bwa index -a bwtsw $reference_fasta
@PG	ID:bwa_sam	PN:bwa	PP:bwa_aln_fastq	VN:0.6.2-r126	CL:bwa sampe -a 1500 -r $rg_line -f $sam_file $reference_fasta $sai_file(s) $fastq_file(s)
@PG	ID:gatk_indel_realigner_gatk22	PN:GenomeAnalysisTK	PP:gatk_realigner_target_creator_gatk22	VN:2.4-9-g532efad	CL:java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -LOD 0.4 -model USE_SW -compress 0
@PG	ID:gatk_left_align_indels_gatk2	PN:GenomeAnalysisTK	PP:bam_fix_mates	VN:2.4-9-g532efad	CL:java $jvm_args -jar GenomeAnalysisTK.jar -T LeftAlignIndels -R $ref -I $bam_files -o $indels_left_aligned_bam_file 
@PG	ID:gatk_realigner_target_creator_gatk22	PN:GenomeAnalysisTK	PP:gatk_left_align_indels_gatk2	VN:2.4-9-g532efad	CL:java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -I $input_bam -o $intervals_file 
@PG	ID:sam_to_fixed_bam	PN:samtools	PP:bwa_sam	VN:0.1.18-r572	CL:samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file
@PG	ID:GATK IndelRealigner	VN:2.4-9-g532efad	CL:knownAlleles=[] targetIntervals=/lustre/scratch110/malaria/ag1000g/output/c/3/4/b/92899/1_gatk_realigner_target_creator_gatk2/pe.1.bam.intervals LODThresholdForCleaning=0.4 consensusDeterminationModel=USE_SW entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null```

Belatedly pointing to the outputs of runs of AN0131-C and AB0252-c through the Short Read Alignment Pipeline:
AN0131-C has cromwell id: 99d5d6b4-4307-4eca-9ef5-119f55051692, it's workflow execution directory is: /lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692

AB0252-C has cromwell id: fa0f695b-f043-4d20-95d1-1ae6aa684882, it's workflow execution directory is:
/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882

Outputs from the pipeline for AN0131-C:
"outputs": { "ShortReadAlignment.indel_realigner_interval_list_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-RealignerTargetCreator/cacheCopy/execution/AN0131-C.intervals", "ShortReadAlignment.callable_loci_summary_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-GatkCallableLoci/cacheCopy/execution/AN0131-C.gatk_callable_loci.summary.txt", "ShortReadAlignment.samtools_stats_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-SamtoolsStats/execution/AN0131-C.samtools_stats_report.txt", "ShortReadAlignment.validate_samfile_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ValidateSamFile/cacheCopy/execution/AN0131-C.validation_report.txt", "ShortReadAlignment.output_sams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-0/cacheCopy/execution/AN0131-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-1/cacheCopy/execution/AN0131-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-2/cacheCopy/execution/AN0131-C.sam" ], "ShortReadAlignment.output_bam": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-IndelRealigner/cacheCopy/execution/AN0131-C.bam", "ShortReadAlignment.output_bams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-0/cacheCopy/execution/AN0131-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-1/cacheCopy/execution/AN0131-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-2/cacheCopy/execution/AN0131-C.bam" ] },

Outputs from the pipeline for AB0252-C:
"outputs": { "ShortReadAlignment.indel_realigner_interval_list_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-RealignerTargetCreator/execution/AB0252-C.intervals", "ShortReadAlignment.callable_loci_summary_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-GatkCallableLoci/execution/AB0252-C.gatk_callable_loci.summary.txt", "ShortReadAlignment.samtools_stats_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-SamtoolsStats/execution/AB0252-C.samtools_stats_report.txt", "ShortReadAlignment.validate_samfile_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ValidateSamFile/execution/AB0252-C.validation_report.txt", "ShortReadAlignment.output_sams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-0/execution/AB0252-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-1/execution/AB0252-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-2/execution/AB0252-C.sam" ], "ShortReadAlignment.output_bam": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-IndelRealigner/execution/AB0252-C.bam", "ShortReadAlignment.output_bams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-0/execution/AB0252-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-1/execution/AB0252-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-2/execution/AB0252-C.bam" ] },
Let me know if you'd like me to copy them elsewhere.

Closing as complete.