High proportion of Unclassified TE
Closed this issue · 5 comments
Hi Prof Ou, thank you for your tools. Lately I have used EDTA2 to construct a custom TE lib used for repeatmasker, but I obtained a result of high proportion of unclassified TE results:
total length: 467763953 bp (467698076 bp excl N/X-runs)
GC level: 34.52 %
bases masked: 268621698 bp ( 57.43 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 1871 485234 bp 0.10 %
SINEs: 0 0 bp 0.00 %
Penelope 0 0 bp 0.00 %
LINEs: 285 109823 bp 0.02 %
CRE/SLACS 0 0 bp 0.00 %
L2/CR1/Rex 0 0 bp 0.00 %
R1/LOA/Jockey 0 0 bp 0.00 %
R2/R4/NeSL 0 0 bp 0.00 %
RTE/Bov-B 0 0 bp 0.00 %
L1/CIN4 285 109823 bp 0.02 %
LTR elements: 1586 375411 bp 0.08 %
BEL/Pao 0 0 bp 0.00 %
Ty1/Copia 1177 283215 bp 0.06 %
Gypsy/DIRS1 306 68694 bp 0.01 %
Retroviral 81 20171 bp 0.00 %
DNA transposons 2281 962695 bp 0.21 %
hobo-Activator 0 0 bp 0.00 %
Tc1-IS630-Pogo 0 0 bp 0.00 %
En-Spm 0 0 bp 0.00 %
MuDR-IS905 0 0 bp 0.00 %
PiggyBac 0 0 bp 0.00 %
Tourist/Harbinger 0 0 bp 0.00 %
Other (Mirage, 0 0 bp 0.00 %
P-element, Transib)
Rolling-circles 88 23473 bp 0.01 %
Unclassified: 439889 261152899 bp 55.84 %
Total interspersed repeats: 262600828 bp 56.15 %
Small RNA: 0 0 bp 0.00 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 117716 4583328 bp 0.98 %
Low complexity: 28300 1414069 bp 0.30 %
And I checked the .out file generated by repeatmasker, I found that some known TE (such as COPIA) were classified into Unspecified, such as:
447 30.7 6.8 5.0 VadGH6_chr01_hap1 27396 27687 (24248828) C COPIA-46_VV-I Unspecified (1037) 4377 4173 44
31416 8.8 0.6 0.3 VadGH6_chr01_hap1 37227 41692 (24234823) C COPIA-79_VV-I Unspecified (0) 4168 4 58
I used EDTA2 for TE lib construction, TEsorter for unknown TE classification and repeatmasker (instead of EDTA2 with --step anno
) for TE masking, is there any way to improve my results? If EDTA2 with --step anno
would be working? And I would like to use LAI to check the quality of my assembly, can I do that with such results?
Best regards,
Xiukun
Thank you for your reply, is the format like with #
and /
to indicate the TE classification like:
>TE_00001056_INT#LTR/unknown
It seems TEsorter could help, thanks!
@yaoxkkkkk yes, that's the right formatting.
Hi Prof Ou, thank you for your patient reply and that helps me a lot. Since the one of EDTA2 output files EDTA.TElib.novel.fa is the new discovered TE in input genome file, on the basis of the existing results, here is what I want to do:
- modified the naming format of curated library provided by
--curatedlib
- further classification on EDTA.TElib.novel.fa
- concatenate the both file generated by above two steps and input RepeatMasker for next step repeatmasking
Will it work? I would like give a try.
if you provide your curated lib with correct formatting to EDTA with --curatedlib
and invoke --anno 1
, you should get the whole-genome annotation by the end of your EDTA run. EDTA incorporates the classification and annotation functions in it.