nerettilab/RepEnrich2

About repeatmasker.bed file in human

Closed this issue · 3 comments

I had downloaded a repeat masker file from UCSC. I checked that the repeat masker in this file are different with the bed file which you provide for us (https://drive.google.com/drive/folders/0B8_2gE04f4QWNmdpWlhaWEYwaHM). The file from ucsc contains 1395 gene, and the file which you give us contains 1116 genes, and the overlap gene are 1005. Thus, I was wondering how to get the gene family and gene class for human.

Thanks for your time

Hi there,

Thank you for your interest in RepEnrich2 - I believe the bedfile provided has simple and low-complexity repeats removed, which likely explains the difference in elements between the lists. If you'd like to run RepEnrich2 with your own list of elements from repeatmasker, there are steps provided in the section of the tutorial labeled "Attain repetitive element annotation". RepEnrich2 supports the use of custom bedfiles, just be sure to remove the header from the UCSC file, make sure it is tab delimited, and adheres to the following format:

Column 1: Chromosome
Column 2: Start
Column 3: End
Column 4: Repeat_name
Column 5: Class
Column 6: Family

The last two columns can be populated by placeholder values if you do not have information for them.

Best,
Nick

Just one more question.

How do you get the fasta file? Does the bed file contains all of the base in your fasta file? Or your fasta file also contains some upstream and downstream of the bed file?

Thanks very much for your help.

Best wishes
Yuanyuan

If you are talking about the fasta file for reference genomes, they should be available from various sources online such as UCSC, Ensembl, or NCBI (and make sure to index it using the command 'bowtie2-build'). If you are referring to the fasta file to use for the run, it should be rna-seq data. The bed file is simply an annotated list of genomic coordinates (chr, start, end) for the features of interest (i.e. repetitive elements).

Best,
Nick