marbl/seqrequester

T2T-CHM13 Microsatellite repeats

Closed this issue · 2 comments

Description of T2T-CHM13 Microsatellite repeats in UCSC Genome Browser. "This track represents % of simple-sequence (2-mer) repeat pattern. Sequences composed of GA/TC/GC/AT bases are counted if one of the bases repeats (e.g. AAAATTTTAAATT are counted As 13 ATs)." Can you explain it in more detail? When using seqrequester microsatellites, can you give an example of what types of microsatellites are being extracted? And how many times is the repetition threshold set? Thank you!

Hello @xiayunNyyl ,

For T2T-CHM13, seqrequester microsatellite was run with

seqrequester microsatellite -prefix $out/$out.microsatellite -window 128 -$p $fa

for every 2-mer microsatellite pattern $p; ga, gc and at. ga automatically looks for its reverse complement tc as well. The number of bases containing more than two bases composed of the pattern is then collected in every 128 bp window. It is like doing homopolymer compression, and looking for consecutive region only composed of the two bases. There are no limits set for the repetition, so one GA surrounded by C for example, CGAC will be counted as for having 2 GA bases. CGAGAC or CGAAAC will both have 4 GA bases.

The actual script that was running to produce the files is here: https://github.com/arangrhie/T2T-Polish/tree/master/pattern