gff2tbl not providing locus_tag for CRISPR
microgrim opened this issue · 4 comments
The gff file (and the IMG genome from which it is derived) does not have a locus tag for CRISPR:
Ga0115370_1019 CRT repeat_region 34874 36813 . 1 . ID=Ga0115370_1019.29;Average_repeat_length=37;Number_of_repeats=27;rpt_type=CRISPR;rpt_unit=34948..34985;
...
Ga0115370_1111 Prodigal V2.6.3 February, 2016 CDS 64316 64669 . 1 0 ID=Ga0115370_1111.63;conf=98.59;gc_cont=0.486;locus_tag=Ga0115370_111160;
...
Ga0115370_1111 CRT repeat_region 80076 80257 . 1 . ID=Ga0115370_1111.79;Average_repeat_length=35;Number_of_repeats=3;rpt_type=CRISPR;rpt_unit=80149..80184;
So gff2tbl does not provide a value for locus_tag for CRISPR hypothetical protein:
>Feature Ga0115370_1019
...
31990 34623 CDS
locus_tag Ga0115370_101928
product CRISPR-associated helicase, Cas3 family
...
>Feature Ga0115370_1111
80076 80257 CRISPR
locus_tag
note hypothetical protein
note locus=''
64316 64669 CDS
locus_tag Ga0115370_111160
product translation initiation factor 1 (eIF-1/SUI1)
...
This makes it problematic for tbl2asn:
[tbl2asn 24.9] ERROR: Qualifier 'locus_tag' has no value on misc_feature feature
at lcl|Ga0115370_1111:80076-80257, relative line 5693
[tbl2asn 24.9] ERROR: Unknown feature CRISPR
IMG does not have locus labels for CRISPR regions. Perhaps user has to manually label CRISPR regions with respect to their locations in the scaffold, e.g. Ga0115370_101928.5 and Ga0115370_111175.5, in order to satisfy tbl2asn? Or they are not parsed in gff2tbl, or unique locus_tags have to be generated on the fly in gff2tbl.
There are 2 gff files that IMG provides:
1). *.assembled.gff, which has the repeat_region value in the Type field, but no "product" values:
Ga0115370_1111 Prodigal V2.6.3 February, 2016 CDS 31793 33064 . 1 0 ID=Ga0115370_1111.29;conf=100.00;gc_cont=0.521;locus_tag=Ga0115370_111126;
...
Ga0115370_1111 CRT repeat_region 80076 80257 . 1 . ID=Ga0115370_1111.79;Average_repeat_length=35;Number_of_repeats=3;rpt_type=CRISPR;rpt_unit=80149..80184;
So the resulting tbl file has the CRISPR okay, but all the other proteins annotated as hypothetical:
31793 33064 CDS
locus_tag Ga0115370_111126
note hypothetical protein
note locus='Ga0115370_111126'
80076 80257 repeat_region
locus_tag Ga0115370_1111.79__CRISPR__80149..80184__Unknown
rpt_type CRISPR
rpt_unit 80149..80184
rpt_family Unknown
2). The other gff file, (IMG genome ID).gff (note the lack of information for the CRISPR, compared to the protein):
Ga0115370_1111 img_core_v400 CDS 31793 33064 . + 0 ID=2663545982;locus_tag=Ga0115370_111126;product=Predicted arabinose efflux permease, MFS family
...
Ga0115370_1111 img_core_v400 CRISPR 80076 80257 +
The tbl file from this gff has the products for proteins, but nothing for the CRISPRs:
>Feature Ga0115370_1111
80076 80257 CRISPR
locus_tag
note hypothetical protein
note locus=''
...
31793 33064 CDS
locus_tag Ga0115370_111126
product Predicted arabinose efflux permease, MFS family
Also, I tried passing a gene product file to both versions, but it didn't seem to help. This is the format from JGI of the gene product file:
2663546040 Ga0115370_111184 COG4639 Predicted kinase 3.00E-25
2663546040 Ga0115370_111184 pfam13671 AAA_33 2.80E-21
2663546040 Ga0115370_111184 Product_name Predicted kinase
2663546040 Ga0115370_111184 DNA_length 468bp
2663546040 Ga0115370_111184 Protein_length 155aa
2663545982 Ga0115370_111126 COG2814 "Predicted arabinose efflux permease, MFS family" 6.00E-22
2663545982 Ga0115370_111126 pfam07690 MFS_1 2.00E-40
2663545982 Ga0115370_111126 Product_name "Predicted arabinose efflux permease, MFS family"
2663545982 Ga0115370_111126 DNA_length 1272bp
2663545982 Ga0115370_111126 Protein_length 423aa
There isn't any information about CRISPRs in this file, because CRISPRs do not get gene IDs in IMG.
There's your problem! The script assumes that the "gene_product" file had the format:
Locus_tag <TAB> Product <TAB> A bunch of other stuff
Fixed the script to use the new format. Use the gff file with repeat_regions
and this gene_product
file. I haven't been able to test it myself, so let me know if it works (or not).
The updated script is still not fully operational. Using both kinds of gff files as described above, and a newly formatted tab-delimited gene annotation file like so:
Locus Tag Gene Product Name Gene ID Genome ID Genome Name Batch1
Ga0115370_10011 Helix-turn-helix domain-containing protein 2663544129 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100110 Protein of unknown function (DUF1092) 2663544138 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100111 PmbA protein 2663544139 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1
1). *.assembled.gff with or without this gene annotation file generates:
>Feature Ga0115370_1111
...
6211 7053 CDS
locus_tag Ga0115370_11117
note hypothetical protein
note locus='Ga0115370_11117'
31462 30977 CDS
locus_tag Ga0115370_111125
note hypothetical protein
note locus='Ga0115370_111125'
80076 80257 repeat_region
locus_tag Ga0115370_1111.79__CRISPR__80149..80184__Unknown
rpt_type CRISPR
rpt_unit 80149..80184
rpt_family Unknown
So with the provided gene annotation file it's getting the CRISPR, but missing the annotations for everything else.
2). other gff file that has the gene products, with or without the above gene annotation file has this result:
>Feature Ga0115370_1111
...
6211 7053 CDS
locus_tag Ga0115370_11117
product 7,8-dihydropterin-6-yl-methyl-4-(beta-D-ribofuranosyl)aminobenzene 5'-phosphate synthase
31462 30977 CDS
locus_tag Ga0115370_111125
product hypothetical protein
80076 80257 CRISPR
locus_tag __CRISPR
note hypothetical protein
note locus='__CRISPR'
So gff2tbl is interpreting the first gff for CRISPRs but not using a gene annotation table for the products. Whereas gff2tbl is not interpreting CRISPRs but is parsing products directly from the 2nd gff, not from the gene annotation table.
My interpretation of how a provided gene annotation file is parsed, is by tab delimitation. Could the gene annotation table be in the wrong format for type 1 gff?
Can you give me an example of the gene product file the way you would get
from IMG?
On Saturday, June 11, 2016, microgrim notifications@github.com wrote:
The updated script is still not fully operational. Using both kinds of gff
files as described above, and a newly formatted tab-delimited gene
annotation file like so:Locus Tag Gene Product Name Gene ID Genome ID Genome Name Batch1
Ga0115370_10011 Helix-turn-helix domain-containing protein 2663544129 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100110 Protein of unknown function (DUF1092) 2663544138 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100111 PmbA protein 2663544139 2660238729 Oscillatoria limnetica bin re-assembly (V2) 11). *.assembled.gff with or without this gene annotation file generates:
Feature Ga0115370_1111
...
6211 7053 CDS
locus_tag Ga0115370_11117
note hypothetical protein
note locus='Ga0115370_11117'
31462 30977 CDS
locus_tag Ga0115370_111125
note hypothetical protein
note locus='Ga0115370_111125'
80076 80257 repeat_region
locus_tag Ga0115370_1111.79__CRISPR__80149..80184__Unknown
rpt_type CRISPR
rpt_unit 80149..80184
rpt_family UnknownSo with the provided gene annotation file it's getting the CRISPR, but
missing the annotations for everything else.2). other gff file that has the gene products, with or without the above
gene annotation file has this result:Feature Ga0115370_1111
...
6211 7053 CDS
locus_tag Ga0115370_11117
product 7,8-dihydropterin-6-yl-methyl-4-(beta-D-ribofuranosyl)aminobenzene 5'-phosphate synthase
31462 30977 CDS
locus_tag Ga0115370_111125
product hypothetical protein
80076 80257 CRISPR
locus_tag __CRISPR
note hypothetical protein
note locus='__CRISPR'So gff2tbl is interpreting the first gff for CRISPRs but not using a gene
annotation table for the products. Whereas gff2tbl is not interpreting
CRISPRs but is parsing products directly from the 2nd gff, not from the
gene annotation table.My interpretation of how a provided gene annotation file is parsed, is by
tab delimitation. Could the gene annotation table be in the wrong format
for type 1 gff?—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAb_Fie_5vDADqY4Zrb2b3U25vZQ1HZ6ks5qKtfKgaJpZM4IuZ7e
.
Sunit Jain
Metagenomics Scientist
Second Genome, Inc.
web: www.secondgenome.com
home: www.sunitjain.com