dib-lab/kmerDecoder

Generating kmers from annotation files

drtamermansour opened this issue · 0 comments

Annotation files include GFF, GTF and BED files. We can use any of these files to generate k-mers in 3 main scenarios:

  1. If these files are annotation of transcriptomes: We can use gffread (for GFF or GTF files) or getfasta from bedtools (for GFF or BED files). Note: getfasta in bedtools has 2 related arguments (-split and -rna). We to examine their effect carefully

  2. If the user does not want splicing to happen.
    a. If we have a BED file that annotation genomic blocks: getfasta from bedtools is straightforward
    b. If we have transcriptome annotation file but the user needs each exon as a separate entry: We need to convert the GFF or GTF to BED then we can use getfasta from bedtools as in (a).

## gffread can convert GFF to GTF  
gffread example.gff  -T -o example.gtf

##  UCSC_kent_commands has a binary tool to convert gtf to GenePred format 
wget https://github.com/drtamermansour/horse_trans/raw/master/scripts/UCSC_kent_commands/gtfToGenePred
chmod +x gtfToGenePred
./gtfToGenePred example.gtf example.gpred

## I have script that I got from somewhere I do not remember to convert GenePred to BED file
wget https://raw.githubusercontent.com/drtamermansour/horse_trans/master/scripts/genePredToBed
chmod +x genePredToBed
cat example.gpred | ./genePredToBed > example.bed
  1. If we have transcriptome annotation file but the user needs to generate k-mers from non-exonic structures (e.g. introns, upstream sequences, downstream sequences, exon-exon junctions: We can transform the annotation files to BED files then we need to create a simple script to transform this transcriptome BED file into another BED file that represent the target loci of the user