marbl/CHM13

Huge discrepancy in RepeatMasker lists between the UCSC version and the NCBI version

pliang64 opened this issue · 13 comments

Hi Team,

I noticed that there is a huge differences between rmsk.bigBed and GCF_009914755.1_T2T-CHM13v2.0_rm.out as shown below:

5,586,304 rmsk.bed (converted from rmsk.bigBed using bigBedTobed)
11,369,721 GCF_009914755.1_T2T-CHM13v2.0_rm.out

The difference is more than 5 million entries. Is this due to the use of different repeat libraries or filtering for specific repeat types or something else?

Thank you,
Ping Liang

As an update, according to my analysis, the rmsk.bigBed version is missing some critical entries (e.g. full length L1s with full coding capacity).

Hello,

The assembly we released here has been soft-masked using the repeat models discovered by the T2T team. The associated repeat annotation is available here, as well as in bed format.

Could you point me where the above files you are comparing were obtained?

Here are the urls for the two files: http://t2t.gi.ucsc.edu/chm13/hub/t2t-chm13-v2.0/rmsk/rmsk.bigBed and https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_rm.out.gz. I like the UCSC version for having the chromosome number, but didn't expect to see that much of differences. I checked my file sizes and they are matching the sources, so truncated files shouldn't be a cause.

This file - https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/bbi/GCA_009914755.4_T2T-CHM13v2.0.t2tRepeatMasker/chm13v2.0_rmsk.bb
is the most recent repeat masker track, matching the file I linked in my comment above. Here is the full description: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hub_3267197_GCA_009914755.4&g=hub_3267197_t2tRepeatMasker

Let us know if this is still missing critical entries.

The repeat models we used weren't officially released yet, so my guess is that the version on NCBI RefSeq rm uses a slightly older version.
The num. entry discrepancy could be an artifact of fragmented repeat element annotation. I'll let others to chime in.

Thank you Arrang for providing the new location of the data. However, the number of entries in chm13v2.0_rmsk.bb is even smaller (~900,000 less) than what's in rmsk.bigBed, unless I did something wrong in the conversion. FYI, below is what I did: bigBedToBed chm13v2.0_rmsk.bb chm13v2.0_rmsk.bed. It ran without any error message and the output looks to have the right bed content. Let me know where the problem is.

@pliang64 Greetings!

I am working on a full response as we speak. I was hoping to have one last piece of information in order to respond to all of your points. Could you please attach a file of the locations of the aforementioned missing LINE elements?

To fully respond to your inquiry, we have run buildSummary.pl, a utility provided as a part of the RepeatMasker (RM) package, on the NCBI https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_rm.out.gz file as well as the RM .out file generated as part of the T2T repeat analysis.

The NCBI RM indicates that 54.96% of the CHM13 genome was masked, while 54.46 % of the genome was masked in the T2T RM analysis. There are some slight, but not astronomical differences in the output, which are likely born out of the different portions of the RM runs. However, the discrepancy you described occurs in the overall lines of the outputs:

GCF_009914755.1_T2T-CHM13v2.0_rm.out 11369721
GCF_009914755.1_T2T-CHM13v2.0_rm.bed 11369718

chm13v2.0_rmsk.out 5590282
chm13v2.0_rmsk.bed 4636653

summary_NCBI.txt
summary_T2T.txt

The lower numbers of .out compared to the .bed files are because as a part of the conversion from .out to .bed there is merging of overlapping fragments, even if those fragments only overlap by 1 nucleotide.

The overall higher number in the NCBI .out and .bed output could possibly be because there is more fragmentation in the NCBI RM output, as the %repeats masked overall in the two outputs (NCBI vs T2T) are very similar.

For the RM NCBI and T2T runs, the NCBI parameters were as follows

(see: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_rm.run):

RepeatMasker version 4.1.0, sensitive mode
run with blastp version 2.0MP-WashU [01-Jan-2006] [linux24-i786-ILP32F64 2006-01-02T05:13:21]
RepeatMasker Combined Database: Dfam_3.1

$ RepeatMasker -engine wublast -species 'homo sapiens' -s -no_is -cutoff 255 CHM13_genome.fa

Our parameters were similar, but run with different versions of RM and Dfam:

Run #1:

RepeatMasker version 4.1.2-p1, sensitive mode, with the NCBI/RMBLAST [2.9.0+],
Dfam 3.3 + RepBase.

$ RepeatMasker -s -species “human” CHM13_genome.fa -engine rmblast

Run #2:

$ RepeatMasker -s -lib newRepeats.fa -engine rmblast CHM13_genome.fa

Running the RM program with different engines and slightly different parameter settings (i.e., -species vs. -lib) will lead to slight discrepancies. In addition, the version of RM was different between the two runs, and the RM was run on a cluster with a specialized Nextflow RM script. Also, RM can run slightly differently (emphasis on slight) every time it is updated.

We also used a different approach to combine the RM output of two different RM runs. This is because as part of the T2T repeat analysis, there were composite repeats discovered. Like the SVA element, which contains an Alu-like sequence, there can be false positives, e.g., an Alu element can be annotated as an SVA, and vice versa. Similarity, the composite repeats contained fragments of LTR, DNA, SINE, and/or LINE elements, and thus a specialized pipeline was designed to combine the output of two RM runs. The first RM run was completed on an unmasked genome, with the aforementioned settings. The second RM run was completed on a hard-masked genome based on the output of the first RM run with the aforementioned settings.

RM also takes pains to calculate the %GC content for each portion of the genome in order to accurately annotate repeats. Because the second RM run was completed on a hard-masked genome, in contrast to the one total run of RM for the NCBI output, there is a possibility that the %GC content changes enough to make slight changes to the annotated repeats.

@pliang64

I analyzed the insertions you indicated in list analysis above, specifically for the 9 insertions that you indicated were missing from the T2T output. After searching the original T2T RepeatMasker output, I have found all 9 of the loci in the T2T data (see attached:
NCBI_vs_T2T_LINE.xlsx)

It is possible that some insertions/TE annotations will differ between the NCBI version and the T2T version due to the differences in search engine and general approach/pipeline applied (see previous reply for details). For example, the chr2 L1PA3 you indicated was missing the NCBI dataset (present in T2T; missing in NCBI) shows an Alu at the same location in NCBI.

I cannot attach large files (original RepeatMasker .out file output) to this message. However, the data should be publicly available for download for both the T2T and NCBI data.

The bigBed for the T2T might have additional merging that is separated in the .out file as compared to the NCIB bigBed file. We will look into this as well.

Thanks Jessica!

Ping, FYI, the original T2T RepeatMasker .out file is here: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RepeatMasker_4.1.2p1.2022Apr14.out

Please use this .out file until we have a better understanding of the .bb file.

I looked at the " 9 extra entries in the ncbi list" mentioned above, and all of them are in the UCSC CHM13 browser. This track is from this bigBed

https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/bbi/GCA_009914755.4_T2T-CHM13v2.0.t2tRepeatMasker/chm13v2.0_rmsk.bb

So I am puzzled by what you are seeing