Huge discrepancy in RepeatMasker lists between the UCSC version and the NCBI version

Question

Huge discrepancy in RepeatMasker lists between the UCSC version and the NCBI version

pliang64 opened this issue 2 years ago · 13 comments

Hi Team,

I noticed that there is a huge differences between rmsk.bigBed and GCF_009914755.1_T2T-CHM13v2.0_rm.out as shown below:

5,586,304 rmsk.bed (converted from rmsk.bigBed using bigBedTobed)
11,369,721 GCF_009914755.1_T2T-CHM13v2.0_rm.out

The difference is more than 5 million entries. Is this due to the use of different repeat libraries or filtering for specific repeat types or something else?

Thank you,
Ping Liang

Answer 1 · 2022-08-16T11:12:38.000Z

Good question.

…

On Aug 16, 2022, at 6:57 AM, Liang Lab at Brock University ***@***.***> wrote: Hi Team, I noticed that there is a huge differences between rmsk.bigBed and GCF_009914755.1_T2T-CHM13v2.0_rm.out as shown below: 5,586,304 rmsk.bed (converted from rmsk.bigBed using bigBedTobed) 11,369,721 GCF_009914755.1_T2T-CHM13v2.0_rm.out The difference is more than 5 million entries. Is this due to the use of different repeat libraries or filtering for specific repeat types or something else? Thank you, Ping Liang — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

Answer 2 · 2022-08-16T13:42:54.000Z

As an update, according to my analysis, the rmsk.bigBed version is missing some critical entries (e.g. full length L1s with full coding capacity).

Answer 3 · 2022-08-16T13:45:33.000Z

Hello,

The assembly we released here has been soft-masked using the repeat models discovered by the T2T team. The associated repeat annotation is available here, as well as in bed format.

Could you point me where the above files you are comparing were obtained?

Answer 4 · 2022-08-16T13:52:22.000Z

Here are the urls for the two files: http://t2t.gi.ucsc.edu/chm13/hub/t2t-chm13-v2.0/rmsk/rmsk.bigBed and https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_rm.out.gz. I like the UCSC version for having the chromosome number, but didn't expect to see that much of differences. I checked my file sizes and they are matching the sources, so truncated files shouldn't be a cause.

Answer 5 · 2022-08-16T14:17:34.000Z

This file - https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/bbi/GCA_009914755.4_T2T-CHM13v2.0.t2tRepeatMasker/chm13v2.0_rmsk.bb
is the most recent repeat masker track, matching the file I linked in my comment above. Here is the full description: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hub_3267197_GCA_009914755.4&g=hub_3267197_t2tRepeatMasker

Let us know if this is still missing critical entries.

The repeat models we used weren't officially released yet, so my guess is that the version on NCBI RefSeq rm uses a slightly older version.
The num. entry discrepancy could be an artifact of fragmented repeat element annotation. I'll let others to chime in.

Answer 6 · 2022-08-16T14:34:03.000Z

Thank you Arrang for providing the new location of the data. However, the number of entries in chm13v2.0_rmsk.bb is even smaller (~900,000 less) than what's in rmsk.bigBed, unless I did something wrong in the conversion. FYI, below is what I did: bigBedToBed chm13v2.0_rmsk.bb chm13v2.0_rmsk.bed. It ran without any error message and the output looks to have the right bed content. Let me know where the problem is.

Answer 7 · 2022-08-22T21:47:40.000Z

@pliang64 Greetings!

I am working on a full response as we speak. I was hoping to have one last piece of information in order to respond to all of your points. Could you please attach a file of the locations of the aforementioned missing LINE elements?

Answer 8 · 2022-08-22T23:04:27.000Z

To fully respond to your inquiry, we have run buildSummary.pl, a utility provided as a part of the RepeatMasker (RM) package, on the NCBI https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_rm.out.gz file as well as the RM .out file generated as part of the T2T repeat analysis.

The NCBI RM indicates that 54.96% of the CHM13 genome was masked, while 54.46 % of the genome was masked in the T2T RM analysis. There are some slight, but not astronomical differences in the output, which are likely born out of the different portions of the RM runs. However, the discrepancy you described occurs in the overall lines of the outputs:

GCF_009914755.1_T2T-CHM13v2.0_rm.out 11369721
GCF_009914755.1_T2T-CHM13v2.0_rm.bed 11369718

chm13v2.0_rmsk.out 5590282
chm13v2.0_rmsk.bed 4636653

summary_NCBI.txt
summary_T2T.txt

The lower numbers of .out compared to the .bed files are because as a part of the conversion from .out to .bed there is merging of overlapping fragments, even if those fragments only overlap by 1 nucleotide.

The overall higher number in the NCBI .out and .bed output could possibly be because there is more fragmentation in the NCBI RM output, as the %repeats masked overall in the two outputs (NCBI vs T2T) are very similar.

For the RM NCBI and T2T runs, the NCBI parameters were as follows

(see: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_rm.run):

RepeatMasker version 4.1.0, sensitive mode
run with blastp version 2.0MP-WashU [01-Jan-2006] [linux24-i786-ILP32F64 2006-01-02T05:13:21]
RepeatMasker Combined Database: Dfam_3.1

$ RepeatMasker -engine wublast -species 'homo sapiens' -s -no_is -cutoff 255 CHM13_genome.fa

Our parameters were similar, but run with different versions of RM and Dfam:

Run #1:

RepeatMasker version 4.1.2-p1, sensitive mode, with the NCBI/RMBLAST [2.9.0+],
Dfam 3.3 + RepBase.

$ RepeatMasker -s -species “human” CHM13_genome.fa -engine rmblast

Run #2:

$ RepeatMasker -s -lib newRepeats.fa -engine rmblast CHM13_genome.fa

Running the RM program with different engines and slightly different parameter settings (i.e., -species vs. -lib) will lead to slight discrepancies. In addition, the version of RM was different between the two runs, and the RM was run on a cluster with a specialized Nextflow RM script. Also, RM can run slightly differently (emphasis on slight) every time it is updated.

We also used a different approach to combine the RM output of two different RM runs. This is because as part of the T2T repeat analysis, there were composite repeats discovered. Like the SVA element, which contains an Alu-like sequence, there can be false positives, e.g., an Alu element can be annotated as an SVA, and vice versa. Similarity, the composite repeats contained fragments of LTR, DNA, SINE, and/or LINE elements, and thus a specialized pipeline was designed to combine the output of two RM runs. The first RM run was completed on an unmasked genome, with the aforementioned settings. The second RM run was completed on a hard-masked genome based on the output of the first RM run with the aforementioned settings.

RM also takes pains to calculate the %GC content for each portion of the genome in order to accurately annotate repeats. Because the second RM run was completed on a hard-masked genome, in contrast to the one total run of RM for the NCBI output, there is a possibility that the %GC content changes enough to make slight changes to the annotated repeats.

Answer 9 · 2022-08-23T20:26:37.000Z

Hello, team! Thanks for looking into the issue. What I noticed is that with the same criteria, I found a total of 268 L1s that are young (low sequence divergence) and have full ORF1 and ORF2 coding capacity based on the NCBI RM list, and this number is 261. After converting the sequence IDs to the standard chrIDs, I found 9 extra entries in the ncbi list and 2 extra entries in the UCSC list. Since these are full length (close to full length) L1s, fragmentation is unlikely a cause for the difference. I’m also surprised to see the two extra entries in the shorter UCSC list. ***@***.*** InactL1]$ bedtools intersect -v -b UCSC_CHM13.intact1.bed -a ncbi_CHM13.intactL1.bed chr4 57051919 57057951 LINE/L1/L1PA2|2.300|125|6142|(13)|339|1177 6033 + chr4 83274117 83280139 LINE/L1/L1HS|1.400|126|6153|(2)|339|1178 6023 + chr5 19164174 19170197 LINE/L1/L1PA2|1.700|125|6154|(1)|358|1036 6024 + chr5 147145382 147151432 LINE/L1/L1HS|1.400|130|6155|(0)|358|1276 6051 + chr6 39279101 39285077 LINE/L1/L1PA2|1.800|175|6152|(3)|339|1120 5977 + chr6 142752029 142758058 LINE/L1/L1PA2|2.000|124|6155|(0)|305|1177 6030 + chr13 73648230 73654258 LINE/L1/L1HS|1.800|130|6154|(1)|339|1033 6029 + chr16 59840104 59846154 LINE/L1/L1HS|1.600|127|6155|(0)|339|1161 6051 + chrX 104905227 104911260 LINE/L1/L1HS|1.500|125|6155|(0)|339|1011 6034 + ***@***.*** InactL1]$ bedtools intersect -v -a UCSC_CHM13.intact1.bed -b ncbi_CHM13.intactL1.bed chr2 188612414 188618384 L1PA3|5971|0|+|NA|339|1218 chr5 28904258 28910283 L1PA2|6026|0|+|NA|344|1008 While trying to see what’s in the UCSC version around chr2 188.6Mb region, I found the following: two repeating entries with the L1 annotation as in the UCSC list (repetition also found for other entries). This a good part of the differences in number, and this was demonstrated by the line counting after removing the duplicates. 27855 1.800 0.100 0.000 NC_060926.1 188612415 188618384 (54078368) + L1PA3 LINE/L1 179 6155 (0) 692448 27855 1.800 0.100 0.000 NC_060926.1 188612415 188618384 (54078368) + L1PA3 LINE/L1 179 6155 (0) 692448 ***@***.*** InactL1]$ wc -l ../GCF_009914755.1_T2T-CHM13v2.0_rm.out 11369721 ../GCF_009914755.1_T2T-CHM13v2.0_rm.out ***@***.*** InactL1]$ uniq ../GCF_009914755.1_T2T-CHM13v2.0_rm.out |wc -l 5684862 However, I’m still puzzled by the fact that the L1PA3 entry on chr2 from NCBI was not identified as intact L1, and the only possible reason for this is the sequence difference. So I’ll need to compare the sequences between the two versions. In the meantime, it would be great if you get to do some analysis and share the finding with me. Thanks, Ping On Aug 22, 2022, at 5:47 PM, JMStorer ***@***.******@***.***>> wrote: @pliang64<https://github.com/pliang64> Greetings! I am working on a full response as we speak. I was hoping to have one last piece of information in order to respond to all of your points. Could you please attach a file of the locations of the aforementioned missing LINE elements? — Reply to this email directly, view it on GitHub<#66 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC3BS5WLMHINZJBYFD3EZPLV2PYQRANCNFSM56VLJIJQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 10 · 2022-08-23T21:49:33.000Z

@pliang64

I analyzed the insertions you indicated in list analysis above, specifically for the 9 insertions that you indicated were missing from the T2T output. After searching the original T2T RepeatMasker output, I have found all 9 of the loci in the T2T data (see attached:
NCBI_vs_T2T_LINE.xlsx)

It is possible that some insertions/TE annotations will differ between the NCBI version and the T2T version due to the differences in search engine and general approach/pipeline applied (see previous reply for details). For example, the chr2 L1PA3 you indicated was missing the NCBI dataset (present in T2T; missing in NCBI) shows an Alu at the same location in NCBI.

I cannot attach large files (original RepeatMasker .out file output) to this message. However, the data should be publicly available for download for both the T2T and NCBI data.

The bigBed for the T2T might have additional merging that is separated in the .out file as compared to the NCIB bigBed file. We will look into this as well.

Answer 11 · 2022-08-23T21:53:21.000Z

Thanks Jessica!

Ping, FYI, the original T2T RepeatMasker .out file is here: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RepeatMasker_4.1.2p1.2022Apr14.out

Please use this .out file until we have a better understanding of the .bb file.

Answer 12 · 2022-08-23T23:32:57.000Z

I looked at the " 9 extra entries in the ncbi list" mentioned above, and all of them are in the UCSC CHM13 browser. This track is from this bigBed

https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/bbi/GCA_009914755.4_T2T-CHM13v2.0.t2tRepeatMasker/chm13v2.0_rmsk.bb

So I am puzzled by what you are seeing

Answer 13 · 2022-08-31T15:32:14.000Z

Hi Mark, If you missed my earlier post, what I found is that the huge discrepancies in number between the NCBI and UCSC RM list is due to the large number of duplicated entries in the NCBI list with the UCSC list not missing any entries. I have yet to confirm my suspicion of minor sequence variations (SNVs) responsible for detecting a few different entries of intact L1s between the two genome sequences. Thanks, Ping On Aug 23, 2022, at 7:33 PM, Mark Diekhans ***@***.******@***.***>> wrote: I looked at the " 9 extra entries in the ncbi list" mentioned above, and all of them are in the UCSC CHM13 browser. This track is from this bigBed https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/bbi/GCA_009914755.4_T2T-CHM13v2.0.t2tRepeatMasker/chm13v2.0_rmsk.bb So I am puzzled by what you are seeing — Reply to this email directly, view it on GitHub<#66 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC3BS5T43RIEXV5JZL6SXTTV2VNTLANCNFSM56VLJIJQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>