cmdline_iaod: A Python repository from Devlin-Moyer

This package takes a whole-genome sequence file and an annotation file (a GTF or GFF3 file in most cases) and uses them to create an SQLite database of intron annotation information. This package uniquely annotates intron class for every intron annotated in the annotation file, so it can be used to identify all U12-dependent introns in any genome with annotated introns. It also provides some wrapper functions for the sorts of SQL queries that would be useful for querying the database. If you use this, cite https://doi.org/10.1101/620658.

Argument descriptions for create_db.py:

Argument Name	Description
-g --genome	E.g. GRCh38 or hg38 for the latest version of the human genome.
-t --tax_name	E.g. Homo_sapiens; be sure to use snake_case or CamelCase.
-c --common_name	E.g. human or rhesus_macaque; be sure to use snake_case or CamelCase.
-a --annotation	Path to a gtf or gff3 file containing annotation information for the genome of interest. This pipeline was built using ones from Ensembl, but should work on any annotation files.
-s --sequence	Path to a whole-genome FASTA file.
-gs --gene_symbols	If your genome has annotation information on Ensembl, you can pull gene names out of Biomart if you know which Biomart division (default, GRCh37, plants, metazoa, fungi) the genome is in by providing the name of the biomart division as the value to this argument (Googling the genome name and the word "ensembl" is the fastest way to find out which division it's in. If your genome is not annotated in Biomart and you still want to be able to search for introns using their gene, you can provide a path to a tab-delimited file with gene IDs (as annotated in the annotation file you provide) in the first column and gene symbols in the second column. If you don't want to be able to search for introns by gene names and/or don't want to deal with the Biomart thing, leave this blank.

Argument descriptions for search_functions.py:

Argument Name	Description
-u12 --U12_search	Single string that will be tokenized and used as input to a full-text search of the U12-type intron annotation information. Generally works best if you give it gene names/symbols, terminal dinucleotides, or Ensembl gene or transcript ids.
-g --genomes	One or more genome assembly names (e.g. GRCh38 for human; whatever you used when building the database), separated by commas.
-gid --gene_id	Ensembl gene ID; separate multiple IDs with commas.
-tid --transcript_id	Ensembl transcript ID; separate multiple IDs with commas.
-gs --gene_symbol	Gene symbol or name (as annotated by Ensembl); separate multiple names with commas.
-c --intron_class	U12-type or U2-type.
-p --phase	0 if intron is between two codons, 1 if between first and second nucleotides of one codon, 2 if between second and third nucleotides.
-l --length	Intron length in nt/bp; will match exactly.
-lmin --min_length	Min intron length in nt/bp.
-lmax --max_length	Max intron length in nt/bp.
-r --rank	Intron rank in transcript; e.g. the first intron in a transcript is rank 1, the second is rank 2, etc. Matches exactly.
-rmin --min_rank	Min intron rank.
-rmax --max_rank	Max intron rank.
-chr --chromosome	Chromosome containing intron.
-s --strand	Strand containing intron (usually + or -, but maybe 1 or -1 depending on where your gtf came from).'
-b --start	Start coordinate of intron in genome (b stands for beginning). Matches exactly.
-bmin --min_start	Min intron start coordinate.
-bmax --max_start	Max intron start coordinate.
-e --start	Stop coordinate of intron in genome (e stands for end). Matches exactly.
-emin --min_stop	Min intron stop coordinate.
-emax --max_stop	Max intron stop coordinate.
-td --terminal_dinucleotides	First two and last two nucleotides of the intron; most are GT-AG, a large minority of U12-types are AT-AC, and others are generally quite rare. If specifying multiple sets of terminal dinucleotides, separate them with commas. Do not neglect the hyphen in the middle.

Devlin-Moyer/cmdline_iaod