sagnikbanerjee15/Abridge

To Do List

sagnikbanerjee15 opened this issue · 0 comments

  • Write up Dockerfile that will have abridge, samtools, zpaq, and fclqc
  • Rewrite the abridge script to call the underlying software directly. No need to use docker and/or singularity.
  • Remove all occurrences of "informative CIGAR" and rename those to "integrated CIGAR"
  • Add options to calculate the space saved from the program for each SAM field. Report this in the log file
  • Develop a modular approach to compression and decompression. This will be necessary for troubleshooting and also for incorporating enhancements in the future
  • #4
  • Create a single program to compress both single and paired-ended data. Similarly, create one program to decompress both single and paired-end data
  • Store more information on the first line of the compressed file in addition to the flags. For example the endedness of the data
  • Add comments for each function
  • Add more functions and decide if you wish to make those inline
  • WAF to convert numeric data to a string. Use type-casting while calling the function. Write separate functions for signed and unsigned numbers
  • Similarly, create functions for converting strings to numbers
  • Examine the code to read directly from BAM files
  • Optimize memory allocations
  • Read directly from a BAM file - https://www.biostars.org/p/44424/, https://stackoverflow.com/questions/52915853/how-to-build-a-simple-main-cpp-file-using-samtools-c-api, https://samtools.sourceforge.net/sam-exam.shtml
  • Incorporate SAMBAMBA & BAM in the comparison. Also, compare with different ranges of compression levels
  • Perform tests with SAM/BAM files that contain CIGAR without mismatch indicators and also CIGAR with mismatch indicators
  • Compile the rust code and check if it could be made faster with the C compiler
  • Consider removing the section where a multi-line fasta file is generated. Instead, modify the code snippet to read from multi-line fasta
  • Prepare the CWL workflow for carrying out all comparisons. Write a single workflow for both RNA-Seq and DNA-Seq reads
  • Write a launcher for processing all the samples
  • Write CWL scripts for the following software:
    • Deez
    • Samcomp
    • CSAM
    • Samtools (bam & CRAM)
    • Genozip2
  • Remove the adjustment done to quality scores since in this version those will never be stored with the iCIGAR
  • Adjust the MAPQ value. Store X in place of 255 but check if substantial space reduction can be achieved
  • While generating BAM and CRAM files for comparison, retain only the relevant tags - do not store everything
  • Add spring to the compressor list in place of zpaq