A set of scripts to convert genbank into gtf format. These scripts presented here work in serials to prepare the [Cat genome] (http://www.ncbi.nlm.nih.gov/genome/78) annation in gtf format from NCBI's genbank foramt. This set of scripts could be applied to other species whose genome annotation in gtf is not available but only in genbank format for each chromosome.
They are run in serials, the orders are:
- batch_download_genbank.pl
- Download genbank format information of each chromosome
- batch_annotation_mRNA.pl
- Will wrap genbank2gtf_mRNA.pl
- In the gtf file, generate records of those exon regions
- batch_annotation_CDS.pl
- Will wrap genbank2gtf_CDS.pl
- In the gtf file, generate records of those CDS regions, but from each chromosome's genbank file, we could not determine the which protein (protein_id) comes from which transcript (transcript_id), thus, we need to download other genbank files according to protein id to determine the relationship between proteins and transcripts (the next step).
- batch_download_genbank_protein.pl
- get_final_cat_gtf.pl
perldoc batch_download_genbank.pl
BATCH_DOWNLOAD_GENBANK(�U1�s)�er Contributed Perl Documentat�Bi�Ao�Tn�CH_DOWNLOAD_GENBANK(1)
S�SY�YN�NO�OP�PS�SI�IS�S
Use NCBI etuils (efetch) to download each chromosome’s annotation in
genbank format of Cat genome.
U�US�SA�AG�GE�E
perl batch_download_genbank.pl
R�RE�ES�SU�UL�LT�TS�S
In your current working directory. There will generate 20 ChrXX.gb
files.
perl v5.12.3 2012‐12‐19 BATCH_DOWNLOAD_GENBANK(1)
perldoc genbank2gtf_mRNA.pl
GENBANK2GTF_MRNA(1) User Contributed Perl Documentation GENBANK2GTF_MRNA(1)
S�SY�YN�NO�OP�PS�SI�IS�S
Given a genbank format file (.gb), parse its feature parts (mRNA
feature to get exon regions) to get information like transcript id,
gene name, etc., and store the result in gtf format
U�US�SA�AG�GE�E
perl genbank2gtf_mRNA.pl input.gb chromosome >chromosome_mRNA.gtf
R�RE�ES�SU�UL�LT�TS�S
The output is a file named as chromosome_mRNA.gtf, in which chromosome
is given as the input parameter
E�EX�XO�ON�N R�RE�EG�GI�IO�ON�N P�PA�AR�RS�SE�E
C�CO�OD�DE�E
if($f=~/(complement\()?(join\()?(\d+[\d\.\n,> ]+\d)(\))?/){
my $tmp=$3;
$tmp=~s/\s|\n|>//g;
my @array = split ",",$tmp;
my @start;
my @end;
foreach my $s(@array){
if($s=~/(\d+)\.\.(\d+)/){
push @start,$1;
push @end,$2;
}else{
push @start,$s;
push @end,$s+1;
}
}
if($strand eq "−"){
@start=reverse @start;
@end=reverse @end;
}
}
E�EX�XA�AM�MP�PL�LE�E
mRNA complement(join(102429..103045,104811..104942,
105561..105643,105732..105835,105910..106035))
/gene="CCDC115"
/product="coiled−coil domain containing 115"
/note="Derived by automated computational analysis using
gene prediction method: GNOMON. Supporting evidence
includes similarity to: 4 ESTs, 1 Protein"
/transcript_id="XM_003980242.1"
/db_xref="GI:410947098"
/db_xref="GeneID:101080349"
perl v5.12.3 2012‐12‐19 GENBANK2GTF_MRNA(1)
perldoc batch_annotation_mRNA.pl
BATCH_ANNOTATION_MRNA(1�U)�ser Contributed Perl Documentati�Bo�An�TCH_ANNOTATION_MRNA(1)
S�SY�YN�NO�OP�PS�SI�IS�S
A wrap of genbank2gtf_mRNA.pl to batch convert genbank format into gtf
(only those mRNA features)
U�US�SA�AG�GE�E
perl batch_annotation_mRNA.pl
R�RE�ES�SU�UL�LT�TS�S
20 chrXX_mRNA.gtf files
perl v5.12.3 2012‐12‐19 BATCH_ANNOTATION_MRNA(1)
perldoc genbank2gtf_CDS.pl
GENBANK2GTF_CDS(1) User Contributed Perl Documentation GENBANK2GTF_CDS(1)
S�SY�YN�NO�OP�PS�SI�IS�S
Given a genbank format file (.gb), parse its feature parts (CDS feature
to get CDS regions) to get information like transcript id, gene name,
etc., and store the result in gtf format
U�US�SA�AG�GE�E
perl genbank2gtf_CDS.pl input.gb chromosome >chromosome_CDS.gtf
R�RE�ES�SU�UL�LT�TS�S
The output is a file named as chromosome_CDS.gtf, in which chromosome
is given as the input parameter
E�EX�XO�ON�N R�RE�EG�GI�IO�ON�N P�PA�AR�RS�SE�E
C�CO�OD�DE�E
if($f=~/(complement\()?(join\()?(\d+[\d\.\n,> ]+\d)(\))?/){
my $tmp=$3;
$tmp=~s/\s|\n|>//g;
my @array = split ",",$tmp;
my @start;
my @end;
foreach my $s(@array){
if($s=~/(\d+)\.\.(\d+)/){
push @start,$1;
push @end,$2;
}else{
push @start,$s;
push @end,$s+1;
}
}
if($strand eq "−"){
@start=reverse @start;
@end=reverse @end;
}
}
E�EX�XA�AM�MP�PL�LE�E
CDS complement(join(102933..103045,104811..104942,
105561..105643,105732..105835,105910..106020))
/gene="CCDC115"
/note="Derived by automated computational analysis using
gene prediction method: GNOMON."
/codon_start=1
/product="coiled−coil domain−containing protein 115"
/protein_id="XP_003980291.1"
/db_xref="GI:410947099"
/db_xref="GeneID:101080349"
/translation="MAAPDLRAELDSLLLQLFQDLEELEAKRAALNARVEEGWLSLSK
ARYSMGAKSVGPLQYASLMEPQVCVYTSEAQDGLQRFWLVRASAQTPEEVGPREAALR
RRKGLTRTPEPESFPALRDPLNWFGILVPHSLRQAQASFREGLQLAADMATLQIRIDW
GRSQLRGLQEKLKQLEPESA"
perl v5.12.3 2012‐12‐19 GENBANK2GTF_CDS(1)
perldoc batch_annotation_CDS.pl
BATCH_ANNOTATION_CDS(1)�User Contributed Perl Documentatio�Bn�ATCH_ANNOTATION_CDS(1)
S�SY�YN�NO�OP�PS�SI�IS�S
A wrap of genbank2gtf_CDS.pl to batch convert genbank format into gtf
(only those CDS features)
U�US�SA�AG�GE�E
perl batch_annotation_CDS.pl
R�RE�ES�SU�UL�LT�TS�S
20 chrXX_CDS.gtf files
perl v5.12.3 2012‐12‐19 BATCH_ANNOTATION_CDS(1)
perldoc batch_download_genbank_protein.pl
BATCH_DOWNLOAD_GENBANK_�UP�sR�eO�rTE�CI�oN�n(�t1�r)�ibuted Perl Do�Bc�Au�Tm�Ce�Hn�_t�Da�Ot�Wi�No�Ln�OAD_GENBANK_PROTEIN(1)
S�SY�YN�NO�OP�PS�SI�IS�S
Use NCBI etuils (efetch) to download information in genbank format with
a given protein id (NP_XXXXX)
U�US�SA�AG�GE�E
perl batch_download_genbank_protein.pl
R�RE�ES�SU�UL�LT�TS�S
Results are stored in a subfolder named NP in your current working
directory. Files are named is proteinid.gb
perl v5.12.3 2012‐12‐19 BATCH_DOWNLOAD_GENBANK_PROTEIN(1)
perldoc get_final_cat_gtf.pl
GET_FINAL_CAT_GTF(1) User Contributed Perl Documentation GET_FINAL_CAT_GTF(1)
S�SY�YN�NO�OP�PS�SI�IS�S
Merge the mRNA gtf (exon) and CDS gtf (exon) together by fill the
values of transcript_id and transcript_name in the CDS gtf file. This
help define the relationship between transcript_id and protein_id, the
relationship is parsed from the proteinid.gb file
U�US�SA�AG�GE�E
perl get_final_cat_gtf.pl
R�RE�ES�SU�UL�LT�TS�S
In your current working directory. There will generate a file named as
felCat5_final.gtf
perl v5.12.3 2012‐12‐19 GET_FINAL_CAT_GTF(1)