Question on the gff file of the inserted sequence
ggstatgen opened this issue · 1 comments
Hi guys
Thanks for developing reform - sounds like an awesome tool. I have just stumbled on it and am trying to figure out if this could be what I need for an analysis I need to perform.
Essentially, I need to create a modified mouse chromosome where the exon of a gene has been replaced by a stop cassette to knock out the gene. The insertion will contain additional sequence on the 5' and 3' prime end of the stop cassette. I do have the full insertion sequence in fasta.
My purpose is to obtain a 'custom' mouse chromosome which includes the above deactivated gene sequence. It seems your tool is ideal for doing this, however I'm a bit unclear on the meaning of one of the arguments you request in order for the program to run, namely --in_gff
. What should this file contain in my case?
In my understanding, if my insertion sequence was, say, 3Mb long and contained several genes, the gff would contain the absolute coordinates of the genes/exons/transcripts/TSSs/TTSs in this 3MB fasta sequence (where by absolute I mean the first nucleotide in the inserted sequence is at position 0).
Here's a concrete example. Let's say the novel sequence to insert contains only one gene, for example Pax6, described by the Gencode gff3 catalogue as follows
chr2 ENSEMBL transcript 105675513 105697361 . + .
chr2 ENSEMBL exon 105675513 105675649 . + .
chr2 ENSEMBL exon 105675744 105675972 . + .
chr2 ENSEMBL exon 105679810 105679889 . + .
chr2 ENSEMBL exon 105680247 105680307 . + .
chr2 ENSEMBL CDS 105680298 105680307 . + 0
chr2 ENSEMBL start_codon 105680298 105680300 . + 0
chr2 ENSEMBL exon 105683828 105683958 . + .
chr2 ENSEMBL CDS 105683828 105683958 . + 2
chr2 ENSEMBL exon 105684751 105684792 . + .
chr2 ENSEMBL CDS 105684751 105684792 . + 0
chr2 ENSEMBL exon 105684887 105685102 . + .
chr2 ENSEMBL CDS 105684887 105685102 . + 0
chr2 ENSEMBL exon 105685778 105685943 . + .
chr2 ENSEMBL CDS 105685778 105685943 . + 0
chr2 ENSEMBL exon 105691565 105691723 . + .
chr2 ENSEMBL CDS 105691565 105691723 . + 2
chr2 ENSEMBL exon 105692194 105692276 . + .
chr2 ENSEMBL CDS 105692194 105692276 . + 2
chr2 ENSEMBL exon 105692470 105692620 . + .
chr2 ENSEMBL CDS 105692470 105692620 . + 0
chr2 ENSEMBL exon 105692737 105692852 . + .
chr2 ENSEMBL CDS 105692737 105692852 . + 2
chr2 ENSEMBL exon 105695306 105695456 . + .
chr2 ENSEMBL CDS 105695306 105695456 . + 0
chr2 ENSEMBL exon 105696270 105697361 . + .
chr2 ENSEMBL CDS 105696270 105696355 . + 2
chr2 ENSEMBL stop_codon 105696353 105696355 . + 0
chr2 ENSEMBL five_prime_UTR 105675513 105675649 . + .
chr2 ENSEMBL five_prime_UTR 105675744 105675972 . + .
chr2 ENSEMBL five_prime_UTR 105679810 105679889 . + .
chr2 ENSEMBL five_prime_UTR 105680247 105680297 . + .
chr2 ENSEMBL three_prime_UTR 105696356 105697361 . + .
(note I'm only showing the first 8 columns of the gff for clarity here).
Given the above, how would I go about creating a suitable gff input file for reform? Would I need to use a tool to manually annotate the exons/UTRs in my novel fasta (eg MAKER) and pass the resulting gff to reform? Or something else entirely? Apologies if I'm missing something obvious.
Hello,
You would simply need to adjust the co-ordinates of the gff above to relative to the insertion sequence.
For example, if the transcript defined above started at position 1 (the first nucleotide) of your inserted sequence, the first 2 gff lines would look like this:
chr2 ENSEMBL transcript 1 21849 . + .
chr2 ENSEMBL exon 1 137 . + .