gavieira/mitofree

To-do list

gavieira opened this issue · 0 comments

  • The CAP3 problemas occurs because the script tries to use it before the NOVOPlasty assembly is finished. The print statement that indicates the end of the NOVOPlasty assembly occurs right after the program is run. Fix this in order to only run CAP3 after the NOVOPlasty process is over.
  • Use all contigs in NOVOPlasty for MITObim assemblies, one by one, and try to merge them later. If they cannot be merged, then annotate and store the mitogenome in 2 separate contigs. Problem: MITObim assemblies tend to get a little messy in the extremities of its contigs.
  • Add flag that changes download method from 'ftp' to 'ascp' (instructions in evernote)
  • Allow user to enter directly into MITObim with manually merged contigs.
  • Add to folder names: "Circular", "All_features" and "Partial", depending on the assembly results
  • Use CAP3 to merge NOVOPlasty contigs
  • Link script to Taxonomy database (Check by taxonomy ID how many datasets with no mitogenome could be assembled)
  • Use genechecker to annotate and check if the sequence has been circularized
  • Look for MITOS2 software
  • Option allowing the user to select the max number of parallel assemblies
  • Create function to find seed automatically. It could look for COI sequences for the species, and then go up the taxonomic classification (genus, then tribe...). Alternatively, we could use Norgal to generate a seed that will be used by NOVOPlasty.
  • Add option to enter with seed separately
  • Create function to circularize sequence. If length of sequence > 150000 (just an example), compare its extremities (sliding window? Which k-mer?) and if they match, cut the excess and obtain the circularized mitogenome.
  • Create function to merge contigs using sliding window at the edges of the sequences
  • Use this "merging contig" algorithm to merge results from assemblies using: i) multiple k-mers; and/or ii) multiple seeds
  • Is there a way to use the datasets in order to assemble nuclear sequences (such as the rRNA), compare those between all samples and generate a phylogenomic tree adding the nuclear genes that are present in all species? What about those sequences that would enter only in the generation of the tree, how to generate these sequences for them? Should we automatically search for it?
  • OPTIONAL FLAG: --trim. Automatic trimming (Using Trimmomatic?) of data before it is used as input for MITObim.
  • Allow script to accept any version of NOVOPlasty installed in $PATH
  • Improve error catching for NOVOPlasty
  • Improve error catching for download of sra dataset (could use a bash command and catch the exit status - '$?' to identify the problem)
  • Implement parallelism - Download next dataset while the previous is being assembled
  • (Not related) Create a script that adds aliases for simple bioinformatics bash scripts (such as counting number of reads and converting seqin files) to .bash_rc
  • (Not related) Create github pages for my projects.