CEGRcode/scriptmanager

Upgrade PE Statistics with new features

Closed this issue · 2 comments

owlang commented

Earlier iteration of BAM Statistics - Paired End Statistics in Galaxy's Tool Shed: https://toolshed.g2.bx.psu.edu/repository?repository_id=e903b62725d63ab1 outputs histogram information in slightly different format from current paired-end statistics. Update pe-stat tool to mimic Galaxy instance.

owlang commented

Current behavior prints timestamp and insert values. Tool Shed instance prints the filename, some chromosome-specific statistics, some BAM-header info, some statistics about the Insert-size histogram, and then the finally the insert values.

New format will be very similar to Tool Shed instance with the addition of a timestamp up top. For example:

# 2023-10-14 18:27:58.193
# 12141_Reb1.bam
# Chromosome_ID	Chromosome_Size	Aligned_Reads
# chr1	230218	30776.0
# chr2	813184	121272.0
# chr3	316620	55314.0
# chr4	1531933	233911.0
# chr5	576874	101752.0
# chr6	270161	45332.0
# chr7	1090940	182287.0
# chr8	562643	96292.0
# chr9	439888	73105.0
# chr10	745751	117156.0
# chr11	666816	103885.0
# chr12	1078177	506539.0
# chr13	924431	163162.0
# chr14	784333	137223.0
# chr15	1091291	189904.0
# chr16	948066	160366.0
# chrM	85779	1594.0
# 2-micron	6318	22752.0
# Total Genome Size: 1.2163423E7	Total Aligned Tags: 2342622.0
# bwa	# 0.7.17-r1188
# bwa mem -t 4 -v 1 -T 30 -h 5 -M /gpfs/group/bfp2/pughlab/galaxy/tool-data/sacCer3_cegr/bwa_mem_index/sacCer3_cegr/sacCer3_cegr.fa /gpfs/group/bfp2/pughlab/galaxy/files/datasets/000/228/dataset_228399.dat /gpfs/group/bfp2/pughlab/galaxy/files/datasets/000/228/dataset_228398.dat
# MarkDuplicates	# 2.7.1-SNAPSHOT
# picard.sam.markduplicates.MarkDuplicates INPUT=[/gpfs/group/bfp2/pughlab/galaxy/files/datasets/000/322/dataset_322781.dat] OUTPUT=/gpfs/group/bfp2/pughlab/galaxy/job_working_directory/000/229/229786/galaxy_dataset_326988.dat METRICS_FILE=/gpfs/group/bfp2/pughlab/galaxy/job_working_directory/000/229/229786/galaxy_dataset_326987.dat REMOVE_DUPLICATES=false ASSUME_SORTED=true DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. OPTICAL_DUPLICATE_PIXEL_DISTANCE=100    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
# Average Insert Size: 140.24363707854494
# Median Insert Size: 126.0
# Std deviation of Insert Size: 75.77434459841997
# Number of ReadPairs: 1063073.0
# Histogram
# Size (bp)	Frequency
0	0.0
1	0.0
2	15.0
3	10.0
4	15.0
5	11.0
6	22.0
7	25.0
8	22.0
...
owlang commented

Additional features to add:

  • add mode statistic
  • add checkbox to limit average insert size calculation to insert sizes within range of the rendered histogram