Re-running analysis
mylena-s opened this issue · 4 comments
Hi again Clement!
I hope you are OK.
I am writing to ask you about something. I managed to classify some Trinity contigs that were not classified by dnaPipeTE, and I was wondering if there was a way to re-run some of the analysis carried on by the program without running all the pipeline again. Specifically, I would like to re-calculate the Counts.txt file and to produce a new repeatlandscape.
I believe it is posible, because I have been reading your code and If I didn't understand incorrectly, the main statistics about all contigs are already calculated, such as % of divergence of each read to its contig and number of mapping reads and bases. I guess I could do it by adding the reads mapping to this new classified contig to the reads_landscape from sorted.reads_vs_unannoted.blast.out and re-running the Rscript (you already explained me how to do so, hehe).
In the case of the Counts file, I think I could do two things, or I calculate the new counts by hand, adding the pb of this contig to the acording class and substracting from the "Unclassified", or I could try to re-run the funtion "count" that is inside the main python script.
If I follow the second alternative I would have to move the reads matching the contig from sorted.reads_vs_unannoted.blast.out to sorted.reads_vs_annoted.blast.out and also include the new classification inside one_RM_hit_per_Trinity_contigs or there is a simpler way?
I am sorry if this is a little confusing, but I thought It may concern you because some of this things could help If you still want to develop new checkpoints in the pipeline.
Thanks in advanced!!
Mylena
Hi Mylena!
I understand! Yes indeed and we thought about that when we made dnaPipeTE. It is not always ideal, but you can try to re-run it, using the same output directory. dnaPipeTE should find the results from Trinity, and skip it to directly run RepeatMasker. This time, specify your new library with the option -RM_lib (don't forget to merge it with the library you use in the first time).For your new classified references to be recognized, format them in a fasta file and use the RepeatMasker/Repbase nomenclature (>TENAME#CLASS/SUPERFAMILY).
If this doesn't work the way you intended, or takes too long, let me know and I will look what we can do!
Cheers,
Clément
Hi Clement!
Thanks for the quick answer and sorry for the delay. Your suggestion worked well. I have one last question. I want to include other sequences in the repeat landscape (the unknown seq and satellites for ex.). I have already wrote my custom script to make the plot but I wanted to confirm with you if I can use the file Annotation/sorted_blast3 to get the similarity of each read to the contig it maps to (and then get the classification from one_RM_hit_per_Trinity_contigs).
Thanks in advanced
Cheers
Mylena
Hi Mylena! Yes indeed, you can use this file. You can get the annotations by joining this table with the RepeatMasker output. The left-outer join -a1
will also print reads mapped to unannotated dnaPipeTE contig.
join -a1 -12 -21 <output>/Annotation/sorted_blast3 <output>/Annotation/one_RM_hit_per_Trinity_contigs -o 1.3,2.4,2.5
Thanks Clement!
I think that's all!