sestaton/Transposome

GenomeFrac is too high

etwatson opened this issue · 7 comments

Hello again. Initial masking of our genome assembly using RepBase’s library of arthropod repeats masked only 2.13% of the genome, while supplying a de novo repeat library (from RepeatModeler) to RepeatMasker, a substantially higher proportion of the genome was masked before annotation (28.02%). Other de novo methods (RepARK, tedna) also estimate ~30% TEs in the genome.

However, transposome gives me a total fraction of 80%, which is much too high. Also, I have tried many runs with different parameters, and always get the same result.

INFO - Configuration - Sequence number for each BLAST process:      20000
INFO - Configuration - Number of CPUs per thread:                   1
INFO - Configuration - Number of threads:                           6
INFO - Configuration - Output directory:                            SD-200_100k_results_out
INFO - Configuration - In-memory analysis:                          0
INFO - Configuration - Percent identity for matches:                98
INFO - Configuration - Fraction coverage for pairwise matches:      0.98
INFO - Configuration - Merge threshold for clusters:                0.001
INFO - Configuration - Minimum cluster size for annotation:        100
INFO - Configuration - BLAST e-value threshold for annotation:      10
INFO - Configuration - Repeat database for annotation:              /usr/share/repbase/ArthroTE.fa
INFO - Results - Total sequences:                        100000
INFO - Results - Total sequences clustered:              79500
INFO - Results - Total sequences unclustered:            20500
INFO - Results - Repeat fraction from clusters:          0.795
INFO - Results - Singleton repeat fraction:              0.00907317073170732
INFO - Results - Total repeat fraction:                  0.79686
INFO - Results - Total repeat fraction from annotations: 0.79500000000477

What type of sequences are these that are going into the analysis? Also, it would help if you could show the first 10 lines or so of the annotation summary file. There is a very high correspondence between the predicted repeat fraction and those identified in the reference library, which is a good sign. What is the reference library composed of? If it is from repeatmodeler, it is worth noting that there is a lot of error there because of the way repetitive sequences are assembled (meaning they are not real TEs, just repetitive sequences).

There are almost no singletons in this data set, which looks like it is either from a highly repetitive genome or a non-random sample. I would take multiple samples and compare the results. Occasionally you may see an odd result that can be resolved by looking at multiple samples.

The reference library is made of only arthropod repeats from RepBase.
The sequences come from a pooled extraction of hundreds of individuals.
I have sampled reads many times, using a different seed each time, and get the same results.
When I take a much smaller sample, and compare 100 transposome runs, I get an overall GenomeFrac that is over 600%, so I am weary of combining multiple samples.

Could a higher than expected GenomeFrac indicate that there is high replicative transposition activity in some tissues, since we've extracted hundreds of whole individuals?

First 10 lines of the annotation summary file:

ReadNum Superfamily Family  ReadCt/ReadsWithHit HitPerc GenomeFrac
100000  L2  L2-10_Hmel  73/187  0.390374331551  0.310347593583045
100000  R1  R1-2_PBa    17/187  0.090909090909  0.072272727272655
100000  Gypsy   Gypsy-24_DEl-I  11/187  0.058823529412  0.04676470588254
100000  Polinton    Polinton-3_TC   9/187   0.048128342246  0.03826203208557
100000  Helitron    Helitron-like-6b_Hmel   6/187   0.032085561497  0.025508021390115
100000  Gypsy   Gypsy-14_DAn-I  5/187   0.026737967914  0.02125668449163
100000  Polinton    Polinton-5_NVi  5/187   0.026737967914  0.02125668449163
100000  Copia   Copia-1_RP-LTR  4/187   0.021390374332  0.01700534759394
100000  unclassified    Helitron-like-3_Hmel    4/187   0.021390374332  0.01700534759394

Unfortunately, we can't say anything about tissue-specific activity without having those specific libraries, especially with pooled data. I'm not sure 600% means (typo?) but I suspect that pooling of the individuals is throwing off the results because all the genes would also be repetitive and cluster together. If you look at the individual cluster FASTA files, it should be clear how many reads actually match a TE in the reference vs. how many do not. If only a few sequences are matching TE sequences, that would be the source of error because the cluster is assumed to be composed of the majority match type (i.e., best hit). With the constraints you used, this shouldn't be too much of an issue but I would look at the total cluster composition (transposome reports the best hit, which works well for the species I've tested).

Because Transposome is cluster-based, it may report higher estimates than read composition-based analyses due to the inflated copy number of all sequences in pooled data. You may want to take some WGS data from a closely-related species you know the repeat fraction of for certain, and run Transposome on that. If the results are what you expect then I would have to conclude Transposome may not be correct for this analysis. Not to be discouraging, but that is how I would establish a baseline to see if the results are indeed due to the pooling or there is something about the biology that needs to be tweaked with Transposome.

All of the reads in cluster FASTA files match TEs.
I can get sequences from an Illumina library of a single individual, so I will get back to you on that.
Thanks again.

Any updates on this one? Thanks.

I am going to close this issue because there is nothing to be done. Feel free to comment and I'll get back to it.

Essentially, this tool is designed for a single individual/genome, and the issue here is not something I can investigate further with the information we have.

Also, check for contamination as mentioned on the wiki. If you have ribosomal repeats or mitochondria-derived reads in your data, pooling the samples will inflate the repeat level further.