Happy EDTA users with successful cases
oushujun opened this issue · 10 comments
Hi all,
Just update the testing result. It seems that new release TIR can close this issue.
- Please install a new env for the EDTA 20190802 release
- Follow the step by the Shujun provided.
- EDTA_raw
- EDTA_processF
- EDTA -step final
- The time and resource of my plant genome (336M plant genome, 58% repeat estimated by the GenomeScope, 24 cores machine)
Step | maxvmem | time(h) | raw_fa size |
---|---|---|---|
Helitron | 7.914GB | 2.352222 | 1.3Mb |
MITE | 1.529GB | 1.815278 | 4.9kb |
TIR | 42.127GB | 4.895556 | 20Mb |
LTR | 19.049GB | 1.417222 | 2.5Mb |
EDTA_Final | 19.388GB | 19.42389 | 19Mb |
Thanks for the developing.
Bests,
Zhigui
Originally posted by @baozg in #4 (comment)
Thanks for developing EDTA, and this is for your reference.
Fish genome ~480 Mb, EDTA.pl (commit 82b16f6) job finished in 7h.
Linux version 3.10.0-957.27.2.el7.x86_64 (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) -t 24
Step | raw_fa size |
---|---|
TIR | 32.3 Mb |
MITE | 20 Kb |
LTR | 988 Kb |
Helitron | 1.7 Mb |
TElib | 30.4 Mb |
Cheers,
Qiushi
Hi
Finally, my rerun without splitting EDTA_raw.pl into subsets finished too :)
It's an amphidiploid plant genome, about 1GB in size, run with 15 threads, commit 757a96d (the one where MITEHunter is turned off)
It ran for 242.8 hours (10.1 days) if run sequentially (so no splitting up of EDTA_raw.pl steps)
Here's the whole log so you can see which steps took how long:
Wed Sep 4 21:07:28 AWST 2019 EDTA_raw: Check files and dependencies, prepare working directories. Wed Sep 4 21:07:28 AWST 2019 Start to find LTR candidates. Wed Sep 4 21:07:28 AWST 2019 Identify LTR retrotransposon candidates from scratch. Tue Sep 10 00:34:18 AWST 2019 Finish finding LTR candidates. Tue Sep 10 00:34:18 AWST 2019 Start to find TIR candidates. Tue Sep 10 00:34:18 AWST 2019 Identify TIR candidates from scratch. Species: others rm: cannot remove `./TIR-Learner/*': No such file or directory Finish finding TIR candidates. Fri Sep 13 09:13:15 AWST 2019 Start to find MITE candidates. Fri Sep 13 09:13:15 AWST 2019 Identify MITE candidates from scratch. Fri Sep 13 09:13:15 AWST 2019 Warning: Because MITE-Hunter is too slow and only contribute limited new TIR candidates, it is taken down temporary until a better solution is found. Fri Sep 13 09:13:15 AWST 2019 Finish finding MITE candidates. Fri Sep 13 09:13:15 AWST 2019 Start to find Helitron candidates. Fri Sep 13 09:13:15 AWST 2019 Identify Helitron candidates from scratch. Fri Sep 13 23:32:36 AWST 2019 Finish finding Helitron candidates. Fri Sep 13 23:32:36 AWST 2019 Execution of EDTA_raw.pl is finished! Sun Sep 15 07:17:49 AWST 2019 EDTA basic and advcanced filters finished. Sun Sep 15 07:17:49 AWST 2019 Perform EDTA final steps to generate a non-redundant comprehensive TE library: Skip the RepeatModeler step (-sensitive 0). Run EDTA.pl -step final -sensitive 1 if you want to use RepeatModeler. Sun Sep 15 14:01:06 AWST 2019 EDTA final stage finished! Check out the final EDTA TE library: ragoo.fasta.EDTA.TElib.fa
Sizes:
size | raw_fa size |
---|---|
TIR | 55M |
MITE | 55M (copy of TIR due to no MITEHunter) |
LTR | 12M |
Helitron | 23M |
TElib | 46M |
The final TElib has 22,921 sequences.
Running a final RepeatMasker run with the final TElib gives me about 75% of the genome being repeats, which makes me very happy :) I should check how many of these repeats are known within Dfam and Repbase, my guess is not that many as a previous run with an older Repbase version gave me a much lower count of repeats
MUCH LATER EDIT:
I've now rerun the final stage with -sensitive 1
, which turns on RepeatModeler, which takes forever. WIth the same genome as above:
Mon Sep 30 17:04:54 AWST 2019 Perform EDTA final steps to generate a non-redundant comprehensive TE library: Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods. Sat Oct 5 14:51:32 AWST 2019 EDTA final stage finished! Check out the final EDTA TE library: ragoo.fasta.EDTA.TElib.fa
So yeah, adding this step to my pipeline added 5 days of runtime! What this did is add 4,186 repeats of type 'unknown' to my TElib.fa, but these are all tiny (the total filesize increased from 46MB to 47MB). @oushujun is this a problem on my end? Should RepeatModeler have assigned classes to those repeats?
(EDTA) v1.5
Successfully finished a maize genome:
TElib: 55Mb
Thanks Shujun.
Thank you so much for making this awesome tool! It makes TA annotation so much easier and I will for sure use it for my next genomes.
I tried it on a bird (~1.2Gb) with -sensitive 0
and a custom protein library and got this result:
step | threads | Real time | User time | Number |
---|---|---|---|---|
Raw Helitron | 8 | 605m16s | 521m34s | 146 |
Raw TIR | 8 | 1425m38s | 8438m38s | 5108 |
Raw LTR | 8 | 49.51s | 81m16s | 157 |
Rest of pipeline | 23 | 79m27 | 1334m50s | 5278 |
(It probably would have been faster if I had given more threads to the TIR search instead of splitting equally between the three searches!)
Thank you,
Else
I recently finished one plant genome about 1.8 Gb.
But I have a little bit confused the repeatmasker statistic results. It only reported LTR and DNA element content. I can nont check any secondary element content, like SINES.
file name: STH.fa
sequences: 3037
total length: 1601564480 bp (1584403095 bp excl N/X-runs)
GC level: 41.61 %
bases masked: 1309747987 bp ( 81.78 %)
number of length percentage
elements* occupied of sequence
SINEs: 0 0 bp 0.00 %
ALUs 0 0 bp 0.00 %
MIRs 0 0 bp 0.00 %
LINEs: 0 0 bp 0.00 %
LINE1 0 0 bp 0.00 %
LINE2 0 0 bp 0.00 %
L3/CR1 0 0 bp 0.00 %
LTR elements: 1584413 1074016988 bp 67.06 %
ERVL 0 0 bp 0.00 %
ERVL-MaLRs 0 0 bp 0.00 %
ERV_classI 0 0 bp 0.00 %
ERV_classII 0 0 bp 0.00 %
DNA elements: 832180 278509658 bp 17.39 %
hAT-Charlie 0 0 bp 0.00 %
TcMar-Tigger 0 0 bp 0.00 %
Unclassified: 78986 11993249 bp 0.75 %
Total interspersed repeats:1364519895 bp 85.20 %
Small RNA: 0 0 bp 0.00 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 0 0 bp 0.00 %
Low complexity: 0 0 bp 0.00 %
- most repeats fragmented by insertions or deletions
have been counted as one element
The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_3.0, RepBase-20170127
run with rmblastn version 2.9.0+
The query was compared to classified sequences in "STH.fa.EDTA.TElib.fa"
Wed May 6 11:50:19 EDT 2020 Dependency checking:
All passed!
Wed May 6 11:50:45 EDT 2020 Obtain raw TE libraries using various structure-based programs:
Thu May 7 03:10:34 EDT 2020 Obtain raw TE libraries finished.
Thu May 7 03:10:34 EDT 2020 Perform EDTA advcance filtering for raw TE candidates and generate the stage 1 library:
Thu May 7 06:13:03 EDT 2020 EDTA advcance filtering finished.
Thu May 7 06:13:03 EDT 2020 Perform EDTA final steps to generate a non-redundant comprehensive TE library:
Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.
RepeatModeler is finished, but no consensi.fa.classified files found.
Skipping the CDS cleaning step (-cds [File]) since no CDS file is provided.
Sun May 10 02:08:29 EDT 2020 EDTA final stage finished! Check out the final EDTA TE library: STH.fa.EDTA.TElib.fa
Sun May 10 02:08:29 EDT 2020 Perform post-EDTA analysis for whole-genome annotation:
Sun May 10 09:50:57 EDT 2020 TE annotation using the EDTA library has finished! Check out:
Whole-genome TE annotation (total TE: 79.16%): STH.fa.EDTA.TEanno.gff
Low-threshold TE masking for MAKER gene annotation (masked: 48.06%): STH.fa.MAKER.masked
Sun May 10 09:51:05 EDT 2020 Evaluate the level of inconsistency for whole-genome TE annotation (slow step):
Mon May 18 06:27:53 EDT 2020 Evaluation of TE annotation finished! Check out these files:
Overall: STH.fa.EDTA.TE.fa.stat.all.sum
Nested: STH.fa.EDTA.TE.fa.stat.nested.sum
Non-nested: STH.fa.EDTA.TE.fa.stat.redun.sum
Confusion matrix of STH.fa.EDTA.TE.fa.stat for the all category
DNA/D DNA/DTA DNA/DTC DNA/DTH DNA/DTM DNA/DTT DNA/Heeaning DNA/HelitrTM DNA/Helitron DNAa Gypsy L LT LT/Copia LTR/ LTR/Cocovering LTR/Copia LTR/Copig LTR/Coping LTR/G LTR/Gypnknown LTR/Gypof LTR/Gypsy LTR/ering LTR/u LTR/uly LTR/unTR/unknown LTR/ung LTR/unking LTR/unkn LTR/unknowR/unknown LTR/unknown LTR/unknownring LTTR/Gypsy LTy Lof Ly MITE/DTA MITE/DTC MITE/DTH MITE/DTM MITE/DTT covering Misclas_rate
DNA/D 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
DNA/DTA 0 115616 10068 2813 17086 57 0 0 18704 0 0 0 0 0 0 0 10275 0 0 0 0 0 6465 0 0 0 0 0 0 0 0 4218 0 0 0 0 0 2144 159 44 2403 2 0 0.3917
DNA/DTC 0 10936 61128 1089 7734 413 0 0 9215 0 0 0 0 0 0 0 3738 0 0 0 0 0 1445 0 0 0 0 0 0 0 0 2208 0 0 0 0 0 172 937 728 2618 16 0 0.4029
DNA/DTH 0 2528 1434 29335 3219 11 0 0 5818 0 0 0 0 0 0 0 1661 0 0 0 0 0 561 0 0 0 0 0 0 0 0 1438 0 0 0 0 0 275 1 699 471 0 0 0.3818
DNA/DTM 0 12351 6859 3702 348436 121 0 0 18988 0 0 0 0 0 0 0 25783 0 0 0 0 0 9673 0 0 0 0 0 0 0 0 6997 0 0 0 0 0 613 77 855 14894 59 0 0.2247
DNA/DTT 0 110 518 25 222 4388 0 0 202 0 0 0 0 0 0 0 138 0 0 0 0 0 3 0 0 0 0 0 0 0 0 709 0 0 0 0 0 0 0 2 856 4 0 0.3886
DNA/Heeaning 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
DNA/HelitrTM 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
DNA/Helitron 0 22063 11264 8713 96179 137 0 0 830187 0 0 0 0 0 0 0 154178 0 0 0 0 0 31254 0 0 0 0 0 0 0 0 20428 0 0 0 0 0 86 99 707 104423 93 0 0.3513
DNAa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
Gypsy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LT 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LT/Copia 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Cocovering 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Copia 0 9424 17496 2292 67110 66 0 0 29216 0 0 0 0 0 0 0 972675 0 0 0 0 0 40664 0 0 0 0 0 0 0 0 128559 0 0 0 0 0 1222 8 317 955 4 0 0.2341
LTR/Copig 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Coping 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Gypnknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Gypof 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Gypsy 0 10301 1774 622 26632 6 0 0 14217 0 0 0 0 0 0 0 49642 0 0 0 0 0 2675171 0 0 0 0 0 0 1 0 1111152 0 0 0 0 0 155 243 100 1210 1 0 0.3125
LTR/ering 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/u 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/uly 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unTR/unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/ung 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unking 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unkn 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unknowR/unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unknown 0 6511 14195 3411 34422 135 0 0 28722 0 0 0 0 0 0 0 148343 0 0 0 0 0 799980 0 0 0 0 0 0 0 0 2486668 0 0 0 0 0 456 16 316 940 136 0 0.2944
LTR/unknownring 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTTR/Gypsy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
Lof 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
Ly 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
MITE/DTA 0 2087 149 431 682 0 0 0 276 0 0 0 0 0 0 0 741 0 0 0 0 0 80 0 0 0 0 0 0 0 0 349 0 0 0 0 0 14944 0 14 503 0 0 0.2622
MITE/DTC 0 75 1473 0 79 0 0 0 105 0 0 0 0 0 0 0 23 0 0 0 0 0 103 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 6397 0 3 0 0 0.2264
MITE/DTH 0 41 226 930 666 1 0 0 508 0 0 0 0 0 0 0 129 0 0 0 0 0 30 0 0 0 0 0 0 0 0 79 0 0 0 0 0 6 0 5830 278 0 0 0.3317
MITE/DTM 0 998 1429 212 3682 232 0 0 534 0 0 0 0 0 0 0 598 0 0 0 0 0 400 0 0 0 0 0 0 0 0 854 0 0 0 0 0 362 2 511 33915 0 0 0.2244
MITE/DTT 0 1 0 0 43 73 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 1 0 0 0 327 0 0.2937
covering 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
Thank you for the awesome tool. Successfully finished three plant genomes (~3-3.5 Gb):
Step | time | size |
---|---|---|
Raw LTR | 114-128h | 542-695Mb |
Raw TIR | 56-99h | 743-790kb |
Raw Helitron | 36-60h | 186-196Mb |
Rest of pipeline | 52-53h | 56-66Mb |
Raw steps run with EDTA v1.7.8 from bioconda, the rest with EDTA v1.8.5. All run with 48 threads (Intel Platinum 8160F Skylake @ 2.1Ghz, RAM=192000M), --sensitive 0 in the FINAL step and CDS provided.
Cheers,
Kaichi
Hi,
Thanks for developing such a cool software,I have successfully annotated the TEs of an insect genome.
But I want to known if the library constructed by EDTA could be used as the consensus sequences for calculating the age of individual TEs?
Thank you,
Yang yi
Thank you all for using and promoting EDTA since its early stage. As for now, the program has gained some popularity, and works using EDTA have been showing up online. This thread can be retired. Below are some studies citing EDTA:
Homologous chromosomes in asexual rotifer Adineta vaga suggest automixis
The transposable elements of the Drosophila serrata reference panel
Chromosome-level Genome Assembly of a Regenerable Maize Inbred Line A188
Dear all, Shujun
Thank you for the awesome tool. Successfully finished one plant (Rhododendron)genome (~850M):
size | raw_fa size | Percent |
---|---|---|
TIR | 143M | 15.7% |
LTR | 660M | 43.2% |
Helitron | 15M | 1.6% |
TElib | 16.9M | 60.96%(Total) |
command
EDTA.pl --genome genome.fa --anno 1 --threads 40 --step all --species others
The sub-item time is not very clear (it was interrupted several times in the middle), and the total time is about 20h.
Sincerely,
Wen