oushujun/EDTA

High proportion of Unclassified TE

Closed this issue · 5 comments

Hi Prof Ou, thank you for your tools. Lately I have used EDTA2 to construct a custom TE lib used for repeatmasker, but I obtained a result of high proportion of unclassified TE results:

total length:  467763953 bp  (467698076 bp excl N/X-runs)
GC level:         34.52 %
bases masked:  268621698 bp ( 57.43 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements         1871       485234 bp    0.10 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:              285       109823 bp    0.02 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4           285       109823 bp    0.02 %
   LTR elements:      1586       375411 bp    0.08 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia        1177       283215 bp    0.06 %
     Gypsy/DIRS1       306        68694 bp    0.01 %
       Retroviral       81        20171 bp    0.00 %

DNA transposons       2281       962695 bp    0.21 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles         88        23473 bp    0.01 %

Unclassified:       439889    261152899 bp   55.84 %

Total interspersed repeats:   262600828 bp   56.15 %


Small RNA:               0            0 bp    0.00 %

Satellites:              0            0 bp    0.00 %
Simple repeats:     117716      4583328 bp    0.98 %
Low complexity:      28300      1414069 bp    0.30 %

And I checked the .out file generated by repeatmasker, I found that some known TE (such as COPIA) were classified into Unspecified, such as:

   447   30.7  6.8  5.0  VadGH6_chr01_hap1     27396    27687 (24248828) C COPIA-46_VV-I                                                Unspecified              (1037)   4377            4173     44
 31416    8.8  0.6  0.3  VadGH6_chr01_hap1     37227    41692 (24234823) C COPIA-79_VV-I                                                Unspecified                 (0)   4168               4     58

I used EDTA2 for TE lib construction, TEsorter for unknown TE classification and repeatmasker (instead of EDTA2 with --step anno) for TE masking, is there any way to improve my results? If EDTA2 with --step anno would be working? And I would like to use LAI to check the quality of my assembly, can I do that with such results?

Best regards,
Xiukun

Thank you for your reply, is the format like with # and / to indicate the TE classification like:

>TE_00001056_INT#LTR/unknown

It seems TEsorter could help, thanks!

@yaoxkkkkk yes, that's the right formatting.

Hi Prof Ou, thank you for your patient reply and that helps me a lot. Since the one of EDTA2 output files EDTA.TElib.novel.fa is the new discovered TE in input genome file, on the basis of the existing results, here is what I want to do:

  1. modified the naming format of curated library provided by --curatedlib
  2. further classification on EDTA.TElib.novel.fa
  3. concatenate the both file generated by above two steps and input RepeatMasker for next step repeatmasking

Will it work? I would like give a try.

if you provide your curated lib with correct formatting to EDTA with --curatedlib and invoke --anno 1, you should get the whole-genome annotation by the end of your EDTA run. EDTA incorporates the classification and annotation functions in it.