Requires gtfparse and tqdm, in addition to common Anaconda modules NumPy, Pandas, Argparse and gzip.
This script creates a Psix-compatible annotation of cassette exons and constitutive introns directly from a GTF file. This annotation consists of a table specifying the location (chromosome, start and end) of splice junctions. Splice junctions are annotated as supporting the inclusion of a cassette exon (_I1 and _I2), supporting its exclusion (_SE), or constitutive (_CI). You can download ready-to-use mouse (mm10) and human (hg38) annotations here.
The annotation is a table file with the following format:
name | intron | event | gene |
---|---|---|---|
ENSG00000157881_NMD_1_I1 | 1:2514467-2515257:- | ENSG00000157881_NMD_1 | ENSG00000157881 |
ENSG00000157881_NMD_1_I2 | 1:2515402-2515561:- | ENSG00000157881_NMD_1 | ENSG00000157881 |
ENSG00000157881_NMD_1_SE | 1:2514467-2515561:- | ENSG00000157881_NMD_1 | ENSG00000157881 |
ENSG00000157881_ProteinCoding_1_I1 | 1:2521316-2521717:- | ENSG00000157881_ProteinCoding_1 | ENSG00000157881 |
ENSG00000157881_ProteinCoding_1_I2 | 1:2521801-2526463:- | ENSG00000157881_ProteinCoding_1 | ENSG00000157881 |
ENSG00000157881_ProteinCoding_1_SE | 1:2521316-2526463:- | ENSG00000157881_ProteinCoding_1 | ENSG00000157881 |
... | ... | ... | ... |
Each row corresponds to a splice junction or intron. The first column is a name assigned to the splice junction, that is based on the name of the gene that contains the removed intron, and whether it supports the inclusion of a cassette exon (_I1 and _I2), supports its exclusion (_SE), or it is a constitutive intron (_CI). The second column are the intron coordinates in the genome. The third column has the name of the splicing element: for example, ENSG00000157881_ProteinCoding_1 is a protein coding cassette exon, and it is formed by three introns: ENSG00000157881_ProteinCoding_1_I1, ENSG00000157881_ProteinCoding_1_I2 and ENSG00000157881_ProteinCoding_1_SE. The fourth column contains the gene name.
To create an annotation from a GTF file, download GTF2psix.py
and run as follows
python GTF2psix.py --gtf annotation.gtf -o psix_annotation
--gene_name
specifies the tag for gene names to use in the GTF file. For example --gene_name gene_name
will use the gene_name
tag from the 9th column of the GTF file. The default is gene_id
.
--gene_type_tag
. Some GTF files have different tags for the gene types. E.g., gene_type
or gene_biotype
. Specify the tag with this option. Default: gene_type
.
--transcript_type_tag
. Some GTF files have different tags for the transcript types. E.g., transcript_type
or transcript_biotype
. Specify the tag with this option. Default: transcript_type
.
Use the argument --gene_type
to limit the annotation to a specific type of genes. E.g., for an annotation of protein coding genes only, use --gene_type protein_coding
.
You can remove some chromosomes from the annotation using the argument --exclude_chromosome
. E.g., --exclude_chromosome chrM,chrY
will exclude chrM and chrY from the annotation.