Geo-omics/scripts

gff2tbl not providing locus_tag for CRISPR

microgrim opened this issue · 4 comments

The gff file (and the IMG genome from which it is derived) does not have a locus tag for CRISPR:

Ga0115370_1019  CRT repeat_region   34874   36813   .   1   .   ID=Ga0115370_1019.29;Average_repeat_length=37;Number_of_repeats=27;rpt_type=CRISPR;rpt_unit=34948..34985;
...
Ga0115370_1111  Prodigal V2.6.3 February, 2016  CDS 64316   64669   .   1   0   ID=Ga0115370_1111.63;conf=98.59;gc_cont=0.486;locus_tag=Ga0115370_111160;
...
Ga0115370_1111  CRT repeat_region   80076   80257   .   1   .   ID=Ga0115370_1111.79;Average_repeat_length=35;Number_of_repeats=3;rpt_type=CRISPR;rpt_unit=80149..80184;

So gff2tbl does not provide a value for locus_tag for CRISPR hypothetical protein:

>Feature Ga0115370_1019
...
31990   34623   CDS
            locus_tag   Ga0115370_101928
            product  CRISPR-associated helicase, Cas3 family
...
>Feature Ga0115370_1111
80076   80257   CRISPR
            locus_tag   
            note    hypothetical protein
            note    locus=''
64316   64669   CDS
            locus_tag   Ga0115370_111160
            product  translation initiation factor 1 (eIF-1/SUI1)
...

This makes it problematic for tbl2asn:

[tbl2asn 24.9] ERROR: Qualifier 'locus_tag' has no value on misc_feature feature 
at lcl|Ga0115370_1111:80076-80257, relative line 5693
[tbl2asn 24.9] ERROR: Unknown feature CRISPR

IMG does not have locus labels for CRISPR regions. Perhaps user has to manually label CRISPR regions with respect to their locations in the scaffold, e.g. Ga0115370_101928.5 and Ga0115370_111175.5, in order to satisfy tbl2asn? Or they are not parsed in gff2tbl, or unique locus_tags have to be generated on the fly in gff2tbl.

There are 2 gff files that IMG provides:
1). *.assembled.gff, which has the repeat_region value in the Type field, but no "product" values:

Ga0115370_1111  Prodigal V2.6.3 February, 2016  CDS 31793   33064   .   1   0   ID=Ga0115370_1111.29;conf=100.00;gc_cont=0.521;locus_tag=Ga0115370_111126;
...
Ga0115370_1111  CRT repeat_region   80076   80257   .   1   .   ID=Ga0115370_1111.79;Average_repeat_length=35;Number_of_repeats=3;rpt_type=CRISPR;rpt_unit=80149..80184;

So the resulting tbl file has the CRISPR okay, but all the other proteins annotated as hypothetical:

31793   33064   CDS
            locus_tag   Ga0115370_111126
            note    hypothetical protein
            note    locus='Ga0115370_111126'
80076   80257   repeat_region
            locus_tag   Ga0115370_1111.79__CRISPR__80149..80184__Unknown
            rpt_type    CRISPR
            rpt_unit    80149..80184
            rpt_family  Unknown

2). The other gff file, (IMG genome ID).gff (note the lack of information for the CRISPR, compared to the protein):

Ga0115370_1111  img_core_v400   CDS 31793   33064   .   +   0   ID=2663545982;locus_tag=Ga0115370_111126;product=Predicted arabinose efflux permease, MFS family
...
Ga0115370_1111  img_core_v400   CRISPR  80076   80257       +       

The tbl file from this gff has the products for proteins, but nothing for the CRISPRs:

>Feature Ga0115370_1111
80076   80257   CRISPR
            locus_tag   
            note    hypothetical protein
            note    locus=''
...
31793   33064   CDS
            locus_tag   Ga0115370_111126
            product Predicted arabinose efflux permease, MFS family

Also, I tried passing a gene product file to both versions, but it didn't seem to help. This is the format from JGI of the gene product file:

2663546040  Ga0115370_111184    COG4639 Predicted kinase        3.00E-25
2663546040  Ga0115370_111184    pfam13671   AAA_33      2.80E-21
2663546040  Ga0115370_111184    Product_name        Predicted kinase    
2663546040  Ga0115370_111184    DNA_length      468bp   
2663546040  Ga0115370_111184    Protein_length      155aa   

2663545982  Ga0115370_111126    COG2814 "Predicted arabinose efflux permease, MFS family"       6.00E-22
2663545982  Ga0115370_111126    pfam07690   MFS_1       2.00E-40
2663545982  Ga0115370_111126    Product_name        "Predicted arabinose efflux permease, MFS family"   
2663545982  Ga0115370_111126    DNA_length      1272bp  
2663545982  Ga0115370_111126    Protein_length      423aa   

There isn't any information about CRISPRs in this file, because CRISPRs do not get gene IDs in IMG.

There's your problem! The script assumes that the "gene_product" file had the format:

Locus_tag <TAB> Product <TAB> A bunch of other stuff

Fixed the script to use the new format. Use the gff file with repeat_regions and this gene_product file. I haven't been able to test it myself, so let me know if it works (or not).

The updated script is still not fully operational. Using both kinds of gff files as described above, and a newly formatted tab-delimited gene annotation file like so:

Locus Tag   Gene Product Name   Gene ID Genome ID   Genome Name Batch1
Ga0115370_10011 Helix-turn-helix domain-containing protein  2663544129  2660238729  Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100110    Protein of unknown function (DUF1092)   2663544138  2660238729  Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100111    PmbA protein    2663544139  2660238729  Oscillatoria limnetica bin re-assembly (V2) 1

1). *.assembled.gff with or without this gene annotation file generates:

>Feature Ga0115370_1111
...
6211    7053    CDS
            locus_tag   Ga0115370_11117
            note    hypothetical protein
            note    locus='Ga0115370_11117'
31462   30977   CDS
            locus_tag   Ga0115370_111125
            note    hypothetical protein
            note    locus='Ga0115370_111125'
80076   80257   repeat_region
            locus_tag   Ga0115370_1111.79__CRISPR__80149..80184__Unknown
            rpt_type    CRISPR
            rpt_unit    80149..80184
            rpt_family  Unknown

So with the provided gene annotation file it's getting the CRISPR, but missing the annotations for everything else.

2). other gff file that has the gene products, with or without the above gene annotation file has this result:

>Feature Ga0115370_1111
...
6211    7053    CDS
            locus_tag   Ga0115370_11117
            product 7,8-dihydropterin-6-yl-methyl-4-(beta-D-ribofuranosyl)aminobenzene 5'-phosphate synthase
31462   30977   CDS
            locus_tag   Ga0115370_111125
            product hypothetical protein
80076   80257   CRISPR
            locus_tag   __CRISPR
            note    hypothetical protein
            note    locus='__CRISPR'

So gff2tbl is interpreting the first gff for CRISPRs but not using a gene annotation table for the products. Whereas gff2tbl is not interpreting CRISPRs but is parsing products directly from the 2nd gff, not from the gene annotation table.

My interpretation of how a provided gene annotation file is parsed, is by tab delimitation. Could the gene annotation table be in the wrong format for type 1 gff?

Can you give me an example of the gene product file the way you would get
from IMG?

On Saturday, June 11, 2016, microgrim notifications@github.com wrote:

The updated script is still not fully operational. Using both kinds of gff
files as described above, and a newly formatted tab-delimited gene
annotation file like so:

Locus Tag Gene Product Name Gene ID Genome ID Genome Name Batch1
Ga0115370_10011 Helix-turn-helix domain-containing protein 2663544129 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100110 Protein of unknown function (DUF1092) 2663544138 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1
Ga0115370_100111 PmbA protein 2663544139 2660238729 Oscillatoria limnetica bin re-assembly (V2) 1

1). *.assembled.gff with or without this gene annotation file generates:

Feature Ga0115370_1111
...
6211 7053 CDS
locus_tag Ga0115370_11117
note hypothetical protein
note locus='Ga0115370_11117'
31462 30977 CDS
locus_tag Ga0115370_111125
note hypothetical protein
note locus='Ga0115370_111125'
80076 80257 repeat_region
locus_tag Ga0115370_1111.79__CRISPR__80149..80184__Unknown
rpt_type CRISPR
rpt_unit 80149..80184
rpt_family Unknown

So with the provided gene annotation file it's getting the CRISPR, but
missing the annotations for everything else.

2). other gff file that has the gene products, with or without the above
gene annotation file has this result:

Feature Ga0115370_1111
...
6211 7053 CDS
locus_tag Ga0115370_11117
product 7,8-dihydropterin-6-yl-methyl-4-(beta-D-ribofuranosyl)aminobenzene 5'-phosphate synthase
31462 30977 CDS
locus_tag Ga0115370_111125
product hypothetical protein
80076 80257 CRISPR
locus_tag __CRISPR
note hypothetical protein
note locus='__CRISPR'

So gff2tbl is interpreting the first gff for CRISPRs but not using a gene
annotation table for the products. Whereas gff2tbl is not interpreting
CRISPRs but is parsing products directly from the 2nd gff, not from the
gene annotation table.

My interpretation of how a provided gene annotation file is parsed, is by
tab delimitation. Could the gene annotation table be in the wrong format
for type 1 gff?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAb_Fie_5vDADqY4Zrb2b3U25vZQ1HZ6ks5qKtfKgaJpZM4IuZ7e
.

Sunit Jain
Metagenomics Scientist
Second Genome, Inc.
web: www.secondgenome.com
home: www.sunitjain.com