gencorefacility/reform

Question on the gff file of the inserted sequence

ggstatgen opened this issue · 1 comments

Hi guys

Thanks for developing reform - sounds like an awesome tool. I have just stumbled on it and am trying to figure out if this could be what I need for an analysis I need to perform.

Essentially, I need to create a modified mouse chromosome where the exon of a gene has been replaced by a stop cassette to knock out the gene. The insertion will contain additional sequence on the 5' and 3' prime end of the stop cassette. I do have the full insertion sequence in fasta.

My purpose is to obtain a 'custom' mouse chromosome which includes the above deactivated gene sequence. It seems your tool is ideal for doing this, however I'm a bit unclear on the meaning of one of the arguments you request in order for the program to run, namely --in_gff. What should this file contain in my case?

In my understanding, if my insertion sequence was, say, 3Mb long and contained several genes, the gff would contain the absolute coordinates of the genes/exons/transcripts/TSSs/TTSs in this 3MB fasta sequence (where by absolute I mean the first nucleotide in the inserted sequence is at position 0).

Here's a concrete example. Let's say the novel sequence to insert contains only one gene, for example Pax6, described by the Gencode gff3 catalogue as follows

chr2	ENSEMBL	transcript	105675513	105697361	.	+	.
chr2	ENSEMBL	exon	105675513	105675649	.	+	.
chr2	ENSEMBL	exon	105675744	105675972	.	+	.
chr2	ENSEMBL	exon	105679810	105679889	.	+	.
chr2	ENSEMBL	exon	105680247	105680307	.	+	.
chr2	ENSEMBL	CDS	105680298	105680307	.	+	0
chr2	ENSEMBL	start_codon	105680298	105680300	.	+	0
chr2	ENSEMBL	exon	105683828	105683958	.	+	.
chr2	ENSEMBL	CDS	105683828	105683958	.	+	2
chr2	ENSEMBL	exon	105684751	105684792	.	+	.
chr2	ENSEMBL	CDS	105684751	105684792	.	+	0
chr2	ENSEMBL	exon	105684887	105685102	.	+	.
chr2	ENSEMBL	CDS	105684887	105685102	.	+	0
chr2	ENSEMBL	exon	105685778	105685943	.	+	.
chr2	ENSEMBL	CDS	105685778	105685943	.	+	0
chr2	ENSEMBL	exon	105691565	105691723	.	+	.
chr2	ENSEMBL	CDS	105691565	105691723	.	+	2
chr2	ENSEMBL	exon	105692194	105692276	.	+	.
chr2	ENSEMBL	CDS	105692194	105692276	.	+	2
chr2	ENSEMBL	exon	105692470	105692620	.	+	.
chr2	ENSEMBL	CDS	105692470	105692620	.	+	0
chr2	ENSEMBL	exon	105692737	105692852	.	+	.
chr2	ENSEMBL	CDS	105692737	105692852	.	+	2
chr2	ENSEMBL	exon	105695306	105695456	.	+	.
chr2	ENSEMBL	CDS	105695306	105695456	.	+	0
chr2	ENSEMBL	exon	105696270	105697361	.	+	.
chr2	ENSEMBL	CDS	105696270	105696355	.	+	2
chr2	ENSEMBL	stop_codon	105696353	105696355	.	+	0
chr2	ENSEMBL	five_prime_UTR	105675513	105675649	.	+	.
chr2	ENSEMBL	five_prime_UTR	105675744	105675972	.	+	.
chr2	ENSEMBL	five_prime_UTR	105679810	105679889	.	+	.
chr2	ENSEMBL	five_prime_UTR	105680247	105680297	.	+	.
chr2	ENSEMBL	three_prime_UTR	105696356	105697361	.	+	.

(note I'm only showing the first 8 columns of the gff for clarity here).

Given the above, how would I go about creating a suitable gff input file for reform? Would I need to use a tool to manually annotate the exons/UTRs in my novel fasta (eg MAKER) and pass the resulting gff to reform? Or something else entirely? Apologies if I'm missing something obvious.

Hello,

You would simply need to adjust the co-ordinates of the gff above to relative to the insertion sequence.

For example, if the transcript defined above started at position 1 (the first nucleotide) of your inserted sequence, the first 2 gff lines would look like this:

chr2	ENSEMBL	transcript	1	21849	.	+	.
chr2	ENSEMBL	exon	1	137	.	+	.