oushujun/EDTA

Happy EDTA users with successful cases

oushujun opened this issue · 10 comments

Hi all,

Just update the testing result. It seems that new release TIR can close this issue.

  1. Please install a new env for the EDTA 20190802 release
  2. Follow the step by the Shujun provided.
  • EDTA_raw
  • EDTA_processF
  • EDTA -step final
  1. The time and resource of my plant genome (336M plant genome, 58% repeat estimated by the GenomeScope, 24 cores machine)
Step maxvmem time(h) raw_fa size
Helitron 7.914GB 2.352222 1.3Mb
MITE 1.529GB 1.815278 4.9kb
TIR 42.127GB 4.895556 20Mb
LTR 19.049GB 1.417222 2.5Mb
EDTA_Final 19.388GB 19.42389 19Mb

Thanks for the developing.

Bests,
Zhigui

Originally posted by @baozg in #4 (comment)

Thanks for developing EDTA, and this is for your reference.

Fish genome ~480 Mb, EDTA.pl (commit 82b16f6) job finished in 7h.
Linux version 3.10.0-957.27.2.el7.x86_64 (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) -t 24

Step raw_fa size
TIR 32.3 Mb
MITE 20 Kb
LTR 988 Kb
Helitron 1.7 Mb
TElib 30.4 Mb

Cheers,
Qiushi

Hi

Finally, my rerun without splitting EDTA_raw.pl into subsets finished too :)

It's an amphidiploid plant genome, about 1GB in size, run with 15 threads, commit 757a96d (the one where MITEHunter is turned off)

It ran for 242.8 hours (10.1 days) if run sequentially (so no splitting up of EDTA_raw.pl steps)

Here's the whole log so you can see which steps took how long:

Wed Sep  4 21:07:28 AWST 2019   EDTA_raw: Check files and dependencies, prepare working directories.

Wed Sep  4 21:07:28 AWST 2019   Start to find LTR candidates.

Wed Sep  4 21:07:28 AWST 2019   Identify LTR retrotransposon candidates from scratch.

Tue Sep 10 00:34:18 AWST 2019   Finish finding LTR candidates.

Tue Sep 10 00:34:18 AWST 2019   Start to find TIR candidates.

Tue Sep 10 00:34:18 AWST 2019   Identify TIR candidates from scratch.

Species: others
rm: cannot remove `./TIR-Learner/*': No such file or directory
Finish finding TIR candidates.

Fri Sep 13 09:13:15 AWST 2019   Start to find MITE candidates.

Fri Sep 13 09:13:15 AWST 2019   Identify MITE candidates from scratch.

Fri Sep 13 09:13:15 AWST 2019   Warning: Because MITE-Hunter is too slow and only contribute limited new TIR candidates, it is taken down temporary until a better solution is found.

Fri Sep 13 09:13:15 AWST 2019   Finish finding MITE candidates.

Fri Sep 13 09:13:15 AWST 2019   Start to find Helitron candidates.

Fri Sep 13 09:13:15 AWST 2019   Identify Helitron candidates from scratch.

Fri Sep 13 23:32:36 AWST 2019   Finish finding Helitron candidates.

Fri Sep 13 23:32:36 AWST 2019   Execution of EDTA_raw.pl is finished!
Sun Sep 15 07:17:49 AWST 2019   EDTA basic and advcanced filters finished.

Sun Sep 15 07:17:49 AWST 2019   Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                                Skip the RepeatModeler step (-sensitive 0).
Run EDTA.pl -step final -sensitive 1 if you want to use RepeatModeler.

Sun Sep 15 14:01:06 AWST 2019   EDTA final stage finished! Check out the final EDTA TE library: ragoo.fasta.EDTA.TElib.fa

Sizes:

size raw_fa size
TIR 55M
MITE 55M (copy of TIR due to no MITEHunter)
LTR 12M
Helitron 23M
TElib 46M

The final TElib has 22,921 sequences.

Running a final RepeatMasker run with the final TElib gives me about 75% of the genome being repeats, which makes me very happy :) I should check how many of these repeats are known within Dfam and Repbase, my guess is not that many as a previous run with an older Repbase version gave me a much lower count of repeats

MUCH LATER EDIT:

I've now rerun the final stage with -sensitive 1, which turns on RepeatModeler, which takes forever. WIth the same genome as above:

Mon Sep 30 17:04:54 AWST 2019   Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                                Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

Sat Oct  5 14:51:32 AWST 2019   EDTA final stage finished! Check out the final EDTA TE library: ragoo.fasta.EDTA.TElib.fa 

So yeah, adding this step to my pipeline added 5 days of runtime! What this did is add 4,186 repeats of type 'unknown' to my TElib.fa, but these are all tiny (the total filesize increased from 46MB to 47MB). @oushujun is this a problem on my end? Should RepeatModeler have assigned classes to those repeats?

(EDTA) v1.5
Successfully finished a maize genome:
TElib: 55Mb
Thanks Shujun.

Thank you so much for making this awesome tool! It makes TA annotation so much easier and I will for sure use it for my next genomes.

I tried it on a bird (~1.2Gb) with -sensitive 0 and a custom protein library and got this result:

step threads Real time User time Number
Raw Helitron 8 605m16s 521m34s 146
Raw TIR 8 1425m38s 8438m38s 5108
Raw LTR 8 49.51s 81m16s 157
Rest of pipeline 23 79m27 1334m50s 5278

(It probably would have been faster if I had given more threads to the TIR search instead of splitting equally between the three searches!)

Thank you,
Else

I recently finished one plant genome about 1.8 Gb.
But I have a little bit confused the repeatmasker statistic results. It only reported LTR and DNA element content. I can nont check any secondary element content, like SINES.

file name: STH.fa
sequences: 3037
total length: 1601564480 bp (1584403095 bp excl N/X-runs)
GC level: 41.61 %
bases masked: 1309747987 bp ( 81.78 %)

           number of      length   percentage
           elements*    occupied  of sequence

SINEs: 0 0 bp 0.00 %
ALUs 0 0 bp 0.00 %
MIRs 0 0 bp 0.00 %

LINEs: 0 0 bp 0.00 %
LINE1 0 0 bp 0.00 %
LINE2 0 0 bp 0.00 %
L3/CR1 0 0 bp 0.00 %

LTR elements: 1584413 1074016988 bp 67.06 %
ERVL 0 0 bp 0.00 %
ERVL-MaLRs 0 0 bp 0.00 %
ERV_classI 0 0 bp 0.00 %
ERV_classII 0 0 bp 0.00 %

DNA elements: 832180 278509658 bp 17.39 %
hAT-Charlie 0 0 bp 0.00 %
TcMar-Tigger 0 0 bp 0.00 %

Unclassified: 78986 11993249 bp 0.75 %

Total interspersed repeats:1364519895 bp 85.20 %

Small RNA: 0 0 bp 0.00 %

Satellites: 0 0 bp 0.00 %
Simple repeats: 0 0 bp 0.00 %
Low complexity: 0 0 bp 0.00 %

  • most repeats fragmented by insertions or deletions
    have been counted as one element

The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_3.0, RepBase-20170127

run with rmblastn version 2.9.0+
The query was compared to classified sequences in "STH.fa.EDTA.TElib.fa"

Wed May 6 11:50:19 EDT 2020 Dependency checking:
All passed!
Wed May 6 11:50:45 EDT 2020 Obtain raw TE libraries using various structure-based programs:
Thu May 7 03:10:34 EDT 2020 Obtain raw TE libraries finished.

Thu May 7 03:10:34 EDT 2020 Perform EDTA advcance filtering for raw TE candidates and generate the stage 1 library:

Thu May 7 06:13:03 EDT 2020 EDTA advcance filtering finished.

Thu May 7 06:13:03 EDT 2020 Perform EDTA final steps to generate a non-redundant comprehensive TE library:

			Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

			RepeatModeler is finished, but no consensi.fa.classified files found.

			Skipping the CDS cleaning step (-cds [File]) since no CDS file is provided.

Sun May 10 02:08:29 EDT 2020 EDTA final stage finished! Check out the final EDTA TE library: STH.fa.EDTA.TElib.fa
Sun May 10 02:08:29 EDT 2020 Perform post-EDTA analysis for whole-genome annotation:

Sun May 10 09:50:57 EDT 2020 TE annotation using the EDTA library has finished! Check out:
Whole-genome TE annotation (total TE: 79.16%): STH.fa.EDTA.TEanno.gff
Low-threshold TE masking for MAKER gene annotation (masked: 48.06%): STH.fa.MAKER.masked

Sun May 10 09:51:05 EDT 2020 Evaluate the level of inconsistency for whole-genome TE annotation (slow step):

Mon May 18 06:27:53 EDT 2020 Evaluation of TE annotation finished! Check out these files:

			Overall: STH.fa.EDTA.TE.fa.stat.all.sum
			Nested: STH.fa.EDTA.TE.fa.stat.nested.sum
			Non-nested: STH.fa.EDTA.TE.fa.stat.redun.sum

Confusion matrix of STH.fa.EDTA.TE.fa.stat for the all category
DNA/D DNA/DTA DNA/DTC DNA/DTH DNA/DTM DNA/DTT DNA/Heeaning DNA/HelitrTM DNA/Helitron DNAa Gypsy L LT LT/Copia LTR/ LTR/Cocovering LTR/Copia LTR/Copig LTR/Coping LTR/G LTR/Gypnknown LTR/Gypof LTR/Gypsy LTR/ering LTR/u LTR/uly LTR/unTR/unknown LTR/ung LTR/unking LTR/unkn LTR/unknowR/unknown LTR/unknown LTR/unknownring LTTR/Gypsy LTy Lof Ly MITE/DTA MITE/DTC MITE/DTH MITE/DTM MITE/DTT covering Misclas_rate
DNA/D 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
DNA/DTA 0 115616 10068 2813 17086 57 0 0 18704 0 0 0 0 0 0 0 10275 0 0 0 0 0 6465 0 0 0 0 0 0 0 0 4218 0 0 0 0 0 2144 159 44 2403 2 0 0.3917
DNA/DTC 0 10936 61128 1089 7734 413 0 0 9215 0 0 0 0 0 0 0 3738 0 0 0 0 0 1445 0 0 0 0 0 0 0 0 2208 0 0 0 0 0 172 937 728 2618 16 0 0.4029
DNA/DTH 0 2528 1434 29335 3219 11 0 0 5818 0 0 0 0 0 0 0 1661 0 0 0 0 0 561 0 0 0 0 0 0 0 0 1438 0 0 0 0 0 275 1 699 471 0 0 0.3818
DNA/DTM 0 12351 6859 3702 348436 121 0 0 18988 0 0 0 0 0 0 0 25783 0 0 0 0 0 9673 0 0 0 0 0 0 0 0 6997 0 0 0 0 0 613 77 855 14894 59 0 0.2247
DNA/DTT 0 110 518 25 222 4388 0 0 202 0 0 0 0 0 0 0 138 0 0 0 0 0 3 0 0 0 0 0 0 0 0 709 0 0 0 0 0 0 0 2 856 4 0 0.3886
DNA/Heeaning 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
DNA/HelitrTM 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
DNA/Helitron 0 22063 11264 8713 96179 137 0 0 830187 0 0 0 0 0 0 0 154178 0 0 0 0 0 31254 0 0 0 0 0 0 0 0 20428 0 0 0 0 0 86 99 707 104423 93 0 0.3513
DNAa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
Gypsy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LT 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LT/Copia 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Cocovering 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Copia 0 9424 17496 2292 67110 66 0 0 29216 0 0 0 0 0 0 0 972675 0 0 0 0 0 40664 0 0 0 0 0 0 0 0 128559 0 0 0 0 0 1222 8 317 955 4 0 0.2341
LTR/Copig 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Coping 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Gypnknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Gypof 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/Gypsy 0 10301 1774 622 26632 6 0 0 14217 0 0 0 0 0 0 0 49642 0 0 0 0 0 2675171 0 0 0 0 0 0 1 0 1111152 0 0 0 0 0 155 243 100 1210 1 0 0.3125
LTR/ering 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/u 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/uly 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unTR/unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/ung 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unking 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unkn 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unknowR/unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTR/unknown 0 6511 14195 3411 34422 135 0 0 28722 0 0 0 0 0 0 0 148343 0 0 0 0 0 799980 0 0 0 0 0 0 0 0 2486668 0 0 0 0 0 456 16 316 940 136 0 0.2944
LTR/unknownring 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTTR/Gypsy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
LTy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000
Lof 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
Ly 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000
MITE/DTA 0 2087 149 431 682 0 0 0 276 0 0 0 0 0 0 0 741 0 0 0 0 0 80 0 0 0 0 0 0 0 0 349 0 0 0 0 0 14944 0 14 503 0 0 0.2622
MITE/DTC 0 75 1473 0 79 0 0 0 105 0 0 0 0 0 0 0 23 0 0 0 0 0 103 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 6397 0 3 0 0 0.2264
MITE/DTH 0 41 226 930 666 1 0 0 508 0 0 0 0 0 0 0 129 0 0 0 0 0 30 0 0 0 0 0 0 0 0 79 0 0 0 0 0 6 0 5830 278 0 0 0.3317
MITE/DTM 0 998 1429 212 3682 232 0 0 534 0 0 0 0 0 0 0 598 0 0 0 0 0 400 0 0 0 0 0 0 0 0 854 0 0 0 0 0 362 2 511 33915 0 0 0.2244
MITE/DTT 0 1 0 0 43 73 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 1 0 0 0 327 0 0.2937
covering 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1.0000

hkchi commented

Thank you for the awesome tool. Successfully finished three plant genomes (~3-3.5 Gb):

Step time size
Raw LTR 114-128h 542-695Mb
Raw TIR 56-99h 743-790kb
Raw Helitron 36-60h 186-196Mb
Rest of pipeline 52-53h 56-66Mb

Raw steps run with EDTA v1.7.8 from bioconda, the rest with EDTA v1.8.5. All run with 48 threads (Intel Platinum 8160F Skylake @ 2.1Ghz, RAM=192000M), --sensitive 0 in the FINAL step and CDS provided.

Cheers,
Kaichi

Hi,

Thanks for developing such a cool software,I have successfully annotated the TEs of an insect genome.

But I want to known if the library constructed by EDTA could be used as the consensus sequences for calculating the age of individual TEs?

Thank you,
Yang yi

Dear all, Shujun

Thank you for the awesome tool. Successfully finished one plant (Rhododendron)genome (~850M):

size raw_fa size Percent
TIR 143M 15.7%
LTR 660M 43.2%
Helitron 15M 1.6%
TElib 16.9M 60.96%(Total)

command
EDTA.pl --genome genome.fa --anno 1 --threads 40 --step all --species others

The sub-item time is not very clear (it was interrupted several times in the middle), and the total time is about 20h.

Sincerely,
Wen