Upgrade PE Statistics with new features
Closed this issue · 2 comments
owlang commented
Earlier iteration of BAM Statistics - Paired End Statistics in Galaxy's Tool Shed: https://toolshed.g2.bx.psu.edu/repository?repository_id=e903b62725d63ab1 outputs histogram information in slightly different format from current paired-end statistics. Update pe-stat
tool to mimic Galaxy instance.
owlang commented
Current behavior prints timestamp and insert values. Tool Shed instance prints the filename, some chromosome-specific statistics, some BAM-header info, some statistics about the Insert-size histogram, and then the finally the insert values.
New format will be very similar to Tool Shed instance with the addition of a timestamp up top. For example:
# 2023-10-14 18:27:58.193
# 12141_Reb1.bam
# Chromosome_ID Chromosome_Size Aligned_Reads
# chr1 230218 30776.0
# chr2 813184 121272.0
# chr3 316620 55314.0
# chr4 1531933 233911.0
# chr5 576874 101752.0
# chr6 270161 45332.0
# chr7 1090940 182287.0
# chr8 562643 96292.0
# chr9 439888 73105.0
# chr10 745751 117156.0
# chr11 666816 103885.0
# chr12 1078177 506539.0
# chr13 924431 163162.0
# chr14 784333 137223.0
# chr15 1091291 189904.0
# chr16 948066 160366.0
# chrM 85779 1594.0
# 2-micron 6318 22752.0
# Total Genome Size: 1.2163423E7 Total Aligned Tags: 2342622.0
# bwa # 0.7.17-r1188
# bwa mem -t 4 -v 1 -T 30 -h 5 -M /gpfs/group/bfp2/pughlab/galaxy/tool-data/sacCer3_cegr/bwa_mem_index/sacCer3_cegr/sacCer3_cegr.fa /gpfs/group/bfp2/pughlab/galaxy/files/datasets/000/228/dataset_228399.dat /gpfs/group/bfp2/pughlab/galaxy/files/datasets/000/228/dataset_228398.dat
# MarkDuplicates # 2.7.1-SNAPSHOT
# picard.sam.markduplicates.MarkDuplicates INPUT=[/gpfs/group/bfp2/pughlab/galaxy/files/datasets/000/322/dataset_322781.dat] OUTPUT=/gpfs/group/bfp2/pughlab/galaxy/job_working_directory/000/229/229786/galaxy_dataset_326988.dat METRICS_FILE=/gpfs/group/bfp2/pughlab/galaxy/job_working_directory/000/229/229786/galaxy_dataset_326987.dat REMOVE_DUPLICATES=false ASSUME_SORTED=true DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
# Average Insert Size: 140.24363707854494
# Median Insert Size: 126.0
# Std deviation of Insert Size: 75.77434459841997
# Number of ReadPairs: 1063073.0
# Histogram
# Size (bp) Frequency
0 0.0
1 0.0
2 15.0
3 10.0
4 15.0
5 11.0
6 22.0
7 25.0
8 22.0
...
owlang commented
Additional features to add:
- add mode statistic
- add checkbox to limit average insert size calculation to insert sizes within range of the rendered histogram