Generating kmers from annotation files
drtamermansour opened this issue · 0 comments
drtamermansour commented
Annotation files include GFF, GTF and BED files. We can use any of these files to generate k-mers in 3 main scenarios:
-
If these files are annotation of transcriptomes: We can use gffread (for GFF or GTF files) or getfasta from bedtools (for GFF or BED files). Note: getfasta in bedtools has 2 related arguments (-split and -rna). We to examine their effect carefully
-
If the user does not want splicing to happen.
a. If we have a BED file that annotation genomic blocks: getfasta from bedtools is straightforward
b. If we have transcriptome annotation file but the user needs each exon as a separate entry: We need to convert the GFF or GTF to BED then we can use getfasta from bedtools as in (a).
## gffread can convert GFF to GTF
gffread example.gff -T -o example.gtf
## UCSC_kent_commands has a binary tool to convert gtf to GenePred format
wget https://github.com/drtamermansour/horse_trans/raw/master/scripts/UCSC_kent_commands/gtfToGenePred
chmod +x gtfToGenePred
./gtfToGenePred example.gtf example.gpred
## I have script that I got from somewhere I do not remember to convert GenePred to BED file
wget https://raw.githubusercontent.com/drtamermansour/horse_trans/master/scripts/genePredToBed
chmod +x genePredToBed
cat example.gpred | ./genePredToBed > example.bed
- If we have transcriptome annotation file but the user needs to generate k-mers from non-exonic structures (e.g. introns, upstream sequences, downstream sequences, exon-exon junctions: We can transform the annotation files to BED files then we need to create a simple script to transform this transcriptome BED file into another BED file that represent the target loci of the user