To-do list

Question

To-do list

gavieira opened this issue 6 years ago · 0 comments

gavieira commented 6 years ago

The CAP3 problemas occurs because the script tries to use it before the NOVOPlasty assembly is finished. The print statement that indicates the end of the NOVOPlasty assembly occurs right after the program is run. Fix this in order to only run CAP3 after the NOVOPlasty process is over.
Use all contigs in NOVOPlasty for MITObim assemblies, one by one, and try to merge them later. If they cannot be merged, then annotate and store the mitogenome in 2 separate contigs. Problem: MITObim assemblies tend to get a little messy in the extremities of its contigs.
Add flag that changes download method from 'ftp' to 'ascp' (instructions in evernote)
Allow user to enter directly into MITObim with manually merged contigs.
Add to folder names: "Circular", "All_features" and "Partial", depending on the assembly results
Use CAP3 to merge NOVOPlasty contigs
Link script to Taxonomy database (Check by taxonomy ID how many datasets with no mitogenome could be assembled)
Use genechecker to annotate and check if the sequence has been circularized
Look for MITOS2 software
Option allowing the user to select the max number of parallel assemblies
Create function to find seed automatically. It could look for COI sequences for the species, and then go up the taxonomic classification (genus, then tribe...). Alternatively, we could use Norgal to generate a seed that will be used by NOVOPlasty.
Add option to enter with seed separately
Create function to circularize sequence. If length of sequence > 150000 (just an example), compare its extremities (sliding window? Which k-mer?) and if they match, cut the excess and obtain the circularized mitogenome.
Create function to merge contigs using sliding window at the edges of the sequences
Use this "merging contig" algorithm to merge results from assemblies using: i) multiple k-mers; and/or ii) multiple seeds
Is there a way to use the datasets in order to assemble nuclear sequences (such as the rRNA), compare those between all samples and generate a phylogenomic tree adding the nuclear genes that are present in all species? What about those sequences that would enter only in the generation of the tree, how to generate these sequences for them? Should we automatically search for it?
OPTIONAL FLAG: --trim. Automatic trimming (Using Trimmomatic?) of data before it is used as input for MITObim.
Allow script to accept any version of NOVOPlasty installed in $PATH
Improve error catching for NOVOPlasty
Improve error catching for download of sra dataset (could use a bash command and catch the exit status - '$?' to identify the problem)
Implement parallelism - Download next dataset while the previous is being assembled
(Not related) Create a script that adds aliases for simple bioinformatics bash scripts (such as counting number of reads and converting seqin files) to .bash_rc
(Not related) Create github pages for my projects.