High proportion of Unclassified TE

Question

High proportion of Unclassified TE

Closed this issue 25 days ago · 5 comments

Hi Prof Ou, thank you for your tools. Lately I have used EDTA2 to construct a custom TE lib used for repeatmasker, but I obtained a result of high proportion of unclassified TE results:

total length:  467763953 bp  (467698076 bp excl N/X-runs)
GC level:         34.52 %
bases masked:  268621698 bp ( 57.43 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements         1871       485234 bp    0.10 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:              285       109823 bp    0.02 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4           285       109823 bp    0.02 %
   LTR elements:      1586       375411 bp    0.08 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia        1177       283215 bp    0.06 %
     Gypsy/DIRS1       306        68694 bp    0.01 %
       Retroviral       81        20171 bp    0.00 %

DNA transposons       2281       962695 bp    0.21 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles         88        23473 bp    0.01 %

Unclassified:       439889    261152899 bp   55.84 %

Total interspersed repeats:   262600828 bp   56.15 %


Small RNA:               0            0 bp    0.00 %

Satellites:              0            0 bp    0.00 %
Simple repeats:     117716      4583328 bp    0.98 %
Low complexity:      28300      1414069 bp    0.30 %

And I checked the .out file generated by repeatmasker, I found that some known TE (such as COPIA) were classified into Unspecified, such as:

   447   30.7  6.8  5.0  VadGH6_chr01_hap1     27396    27687 (24248828) C COPIA-46_VV-I                                                Unspecified              (1037)   4377            4173     44
 31416    8.8  0.6  0.3  VadGH6_chr01_hap1     37227    41692 (24234823) C COPIA-79_VV-I                                                Unspecified                 (0)   4168               4     58

I used EDTA2 for TE lib construction, TEsorter for unknown TE classification and repeatmasker (instead of EDTA2 with --step anno) for TE masking, is there any way to improve my results? If EDTA2 with --step anno would be working? And I would like to use LAI to check the quality of my assembly, can I do that with such results?

Best regards,
Xiukun

Answer 1 · 2024-11-29T02:35:58.000Z

If you provide a curated library, make sure the sequences follow the repeatmasker naming convention. Shujun

…

On Thu, Nov 28, 2024 at 10:14 AM Xiukun Yao ***@***.***> wrote: Hi Prof Ou, thank you for your tools. Lately I have used EDTA2 to construct a custom TE lib used for repeatmasker, but I obtained a result of high proportion of unclassified TE results: total length: 467763953 bp (467698076 bp excl N/X-runs) GC level: 34.52 % bases masked: 268621698 bp ( 57.43 %) ================================================== number of length percentage elements* occupied of sequence -------------------------------------------------- Retroelements 1871 485234 bp 0.10 % SINEs: 0 0 bp 0.00 % Penelope 0 0 bp 0.00 % LINEs: 285 109823 bp 0.02 % CRE/SLACS 0 0 bp 0.00 % L2/CR1/Rex 0 0 bp 0.00 % R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0.00 % L1/CIN4 285 109823 bp 0.02 % LTR elements: 1586 375411 bp 0.08 % BEL/Pao 0 0 bp 0.00 % Ty1/Copia 1177 283215 bp 0.06 % Gypsy/DIRS1 306 68694 bp 0.01 % Retroviral 81 20171 bp 0.00 % DNA transposons 2281 962695 bp 0.21 % hobo-Activator 0 0 bp 0.00 % Tc1-IS630-Pogo 0 0 bp 0.00 % En-Spm 0 0 bp 0.00 % MuDR-IS905 0 0 bp 0.00 % PiggyBac 0 0 bp 0.00 % Tourist/Harbinger 0 0 bp 0.00 % Other (Mirage, 0 0 bp 0.00 % P-element, Transib) Rolling-circles 88 23473 bp 0.01 % Unclassified: 439889 261152899 bp 55.84 % Total interspersed repeats: 262600828 bp 56.15 % Small RNA: 0 0 bp 0.00 % Satellites: 0 0 bp 0.00 % Simple repeats: 117716 4583328 bp 0.98 % Low complexity: 28300 1414069 bp 0.30 % And I checked the .out file generated by repeatmasker, I found that some known TE (such as COPIA) were classified into Unspecified, such as: 447 30.7 6.8 5.0 VadGH6_chr01_hap1 27396 27687 (24248828) C COPIA-46_VV-I Unspecified (1037) 4377 4173 44 31416 8.8 0.6 0.3 VadGH6_chr01_hap1 37227 41692 (24234823) C COPIA-79_VV-I Unspecified (0) 4168 4 58 I used EDTA2 for TE lib construction, TEsorter for unknown TE classification and repeatmasker (instead of EDTA2 with --step anno) for TE masking, is there any way to improve my results? If EDTA2 with --step anno would be working? And I would like to use LAI to check the quality of my assembly, can I do that with such results? Best regards, Xiukun — Reply to this email directly, view it on GitHub <#520>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NCBD4EED3RSY2F4VJD2C4XOZAVCNFSM6AAAAABSVLOGSWVHI2DSMVQWIX3LMV43ASLTON2WKOZSG4YDENBYGQ4TENQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2024-11-29T02:41:28.000Z

Thank you for your reply, is the format like with # and / to indicate the TE classification like:

>TE_00001056_INT#LTR/unknown

It seems TEsorter could help, thanks!

Answer 3 · 2024-12-01T15:08:17.000Z

@yaoxkkkkk yes, that's the right formatting.

Answer 4 · 2024-12-02T09:15:47.000Z

Hi Prof Ou, thank you for your patient reply and that helps me a lot. Since the one of EDTA2 output files EDTA.TElib.novel.fa is the new discovered TE in input genome file, on the basis of the existing results, here is what I want to do:

modified the naming format of curated library provided by --curatedlib
further classification on EDTA.TElib.novel.fa
concatenate the both file generated by above two steps and input RepeatMasker for next step repeatmasking

Will it work? I would like give a try.

Answer 5 · 2024-12-02T14:55:23.000Z

if you provide your curated lib with correct formatting to EDTA with --curatedlib and invoke --anno 1, you should get the whole-genome annotation by the end of your EDTA run. EDTA incorporates the classification and annotation functions in it.