EddyRivasLab/infernal

Converting cmscan output to GFF3

Closed this issue · 14 comments

Hello,

I came across the infernal_tab2gff.pl script that converts cmsearch tabbed output to gff2 format. Do you have a similar conversion script for cmscan to gff3? If so can you please add the same to infernal?

Thank you
Abhijit

It's possible that script was written for an older version of infernal, can you please attach it (by dragging it onto the comment box), so that I can check? If it is for the current version, a simple small change should make it work for cmscan instead of cmsearch.

I found the code in the folder called "demotic" under "easel". This was in the latest relaese. Anyway here it is. Please also mention if I should be exporting the results as --tblout format (--fmt 2) in order for this to work.

#!/usr/bin/perl -w -I/groups/eddy/home/jonest/Demotic

# TAJ 6/23/08 last mod 7/10/08
# Purpose: flexibly convert "cmsearch --tabfile TAB.out" output to GFF format
# Based upon my "blastn2gff.pl" script. 
#
# OUTPUT GFF2 format:
# ------------------
# CONTIG   METHOD      TYPE            START      END      SCORE    ORI   FRAME  GENE "genename"; note "free text"
# $contig  $method     $type           $GFFstart  $GFFend  $score   $ori  "."    $gene_name       $note 
# na       "Infernal"  "Infernal_hit"  na         na       na       na    "."    ::$cm_name[i]    ""             # Defaults
# na       -m          -t              na         na       na       na    na     -g               -n             # Options
#
# Todo List
# -----------------------------------------------------
# (1) Allow for GFF3 format output
# (2) Possibly all switch to produce separate GFF files for each report
# -----------------------------------------------------

use strict;
use demotic_infernal_tab; 
use Getopt::Std;
getopts ('E:s:G:l:m:t:g:n:u:d:b');
our ($opt_E, $opt_s, $opt_G, $opt_l, $opt_m, $opt_t, $opt_g, $opt_n, $opt_b, $opt_u, $opt_d);

(my $script = $0) =~ s/^.*\///;
my $USAGE = "
Parse cmsearch tabfile output, filter hits on user-supplied cutoffs, and output hits
in GFF2 format. 
======================================================================================
   USAGE: $script <options> tabfile.out > foo.gff
======================================================================================
     tabfile.out ==> Output file created by using cmsearch switch '--tabfile tab.out' 
                     from Infernal rc1.0
OPTIONS
--------------------------------------------------------------------------------------
 -E  Eval_cutoff       (E)-value cutoff -- reject hits with E-value > Eval_cutoff
 -s  score_cutoff      (s)score cutoff -- reject hits with bitscore < score_cutoff
 -G  GC_cutoff         (G)C percent cutoff (0..100) -- reject hits with GC < GC_cutoff
 -l  len_cutoff        (l)ength cutoff -- reject hits with length < len_cutoff
 -m  method            (m)ethod. Default is 'Infernal'        # for GFF output
 -t  type              (t)ype. Default is 'Infernal_hit'      # for GFF output
 -g  gene_name         (g)ene name. Default is CM query name  # for GFF output  (1)
 -n  \"a short note\"  (n)ote. No default                     # for GFF output
 -u  X                 (u)pstream pad   -- add X NTs to BEGINNING of all hits   (2)
 -d  Y                 (d)ownstream pad -- add Y Nts to END       of all hits   (2)
--------------------------------------------------------------------------------------
NOTES:
 (1) == Default behavior obtains a (non-unique) 'gene name' from the CM name. For 
        multiple CM queries having the same name, Infernal differentiates the models
        by adding a '.N' version (eg CM.1, CM.2, etc). Specifying the gene name by '-g' 
        results in the same gene name being used for all hits in all reports contained 
        in that tabfile. 
 (2) == The values always add to the length of the feature. X and Y cannot be negative!
        It's highly recommended that you use the '-n NOTE' feature to annotate the 
        changes made to the GFF start and end sites due to these flags. Script warns
        when padding beyond the 5' end of the contig (and sets it to 1), but cannot edge
        detect for the 3' end of the contig. 
*************      While this script can parse 'tabfile' files with output from 
*  WARNING  *      multiple queries (possibly containing very different CMs), only one
*************      GENE name, METHOD and TYPE will be added to all GFF lines created!
";

my $CMs          = 0;     # number of CMs in tabfile
my $hits         = 0;     # number of hits within current CM
my $Eval_cutoff  = 10000; # arbitrarily large E-value default cutoff
my $score_cutoff = -1000; # arbitrarily low bitscore default cutoff
my $GC_cutoff    = 0;     # minimum %GC cutoff 
my $len_cutoff   = 0;     # minimum length cutoff
my $up_pad       = 0;     # add X NTs to upstream/start of hit   NB: applies to all hits!
my $down_pad     = 0;     # add Y NTs to downstream/end of hit   NB: applies to all hits!
my $pass_filter  = 0;
 
# get cutoff & coord padding options (NB: opt's m,t,g,n  obtained during sub print_GFF2)
$Eval_cutoff  = $opt_E              if ($opt_E);
$score_cutoff = $opt_s              if ($opt_s); 
$GC_cutoff    = $opt_G              if ($opt_G); 
$len_cutoff   = $opt_l              if ($opt_l);
$up_pad       = $opt_u              if ($opt_u);
$down_pad     = $opt_d              if ($opt_d);
if ($up_pad   !~ /^\+?\d+$/) { die "Illegal pad: '$up_pad'; must be a whole positive number.";   }
if ($down_pad !~ /^\+?\d+$/) { die "Illegal pad: '$down_pad'; must be a whole positive number."; }
if (($up_pad > 100000) || ($down_pad > 100000)) {
    die "Whoa, whoa! Slow down their Feyman. You're being a bit excessive with your pads. --mgmt";
}


#  =========================================================
#  |   demotic_infernal_tab -- Parameters                  |
#  =========================================================
# 
#  Returns SINGLE STRING; applies to entire tabfile (regardless of no. of CMs)
#  ----------------------------------------------------------------------------
#    $model_num     # number of CM's used in the search                               # check
#    $command       # command line used to run search                                 # check
#    $date          # date search was run                                             # check
#    $db_recs       # number of records in target DB                                  # check
#    $db_size       # DB size (in MB)                                                 # check
#
#  Returns LIST; 0th position in list corresponds to 1st CM in the tabfile 
#        EX:  $cm_name[2]      # name of the CM query used in 3rd report
#  ----------------------------------------------------------------------------
#    @cm_name       # Covariation Model(s) name; 1st model is 0th in array            # check
#    @time_expect   # expected run time (quoted, not converting into hr:min)          # check
#    @time_actual   # actual run time   (quoted, not converting into hr:min)          # check
#    @num_hits      # tracks number of hits for each model                            # check
#
#  Returns 2D ARRAY [i][j];
#     "i" = list of CMs in report (as above); frequently only 1 report, or i=0
#     "j" = list of _hits_ for each CM (0th -> 1st hit), in order appearing in report
#        EX:  ${$GC[1]}->[3]   # GC value of 4th (3+1) hit in the 2nd (1+1) search 
#  ----------------------------------------------------------------------------
#    @t_name        # (2D); target fasta record name                                  # check
#    @t_start       # (2D); start location in fasta target \___   For "-" ori hits,   # check
#    @t_stop        # (2D); stop  location in fasta target /      t_start > t_stop    # check
#    @q_start       # (2D); start location in CM query     \___   Regardless of ori,  # check
#    @q_stop        # (2D); stop  location in CM query     /      q_start < q_stop    # check
#    @bitscore      # (2D); bit score                                                 # check
#    @Eval          # (2D); E-value  // can be in scientific notation                 # check
#    @GC            # (2D); %GC                                                       # check
#  =========================================================


# Parse infernal 'tabfile'; possibly involving multiple query CMs
die $USAGE unless (@ARGV == 1);
my $tabfile = shift;
open (TABFILE, "$tabfile") || die "Can't open $tabfile. You fuckin' wif me?"; 
&demotic_infernal_tab::parse(\*TABFILE); 
close TABFILE;

$CMs  = $demotic_infernal_tab::model_num;                  # number of query CMs

# For all hits, for all CMs in tabfile report -> print in GFF2 formt if passes all cutoffs
foreach my $rep_num (0..$CMs-1) {                          # looping over i CMs
    $hits = $demotic_infernal_tab::num_hits[$rep_num]; 
    foreach my $hit_num (0..$hits-1) {                     # looping over j hits from ith CM 
	$pass_filter = &filter_hit($rep_num, $hit_num); 
	if ($pass_filter) {
	    if ($pass_filter) {
		&print_GFF2 ($rep_num, $hit_num);
	    }
	    else  {  next;  } 
	}
    }
}


############# Validation -- Make sure I'm parsing hit list for all CMs properly  #############
if ($opt_b) {      # Thouroughly embarassing parse dump, for manual validation // not complete
    foreach my $rep (0..($CMs-1)) {
	my $name = $demotic_infernal_tab::cm_name[$rep]; 
	$hits = $demotic_infernal_tab::num_hits[$rep];
	print "------------------------------------------------\n";
	print "Model \#", ($rep+1), " ==> cm_name: [$name]  containing $hits hits\n"; 
	print "------------------------------------------------\n";
	foreach my $j (0..$hits-1) { # looping over all hits for that CM
	    my $contig  = $demotic_infernal_tab::t_name[$rep]->[$j];
	    my $start_t = $demotic_infernal_tab::t_start[$rep]->[$j];
	    my $stop_t  = $demotic_infernal_tab::t_stop[$rep]->[$j];
	    my $start_q = $demotic_infernal_tab::q_start[$rep]->[$j];
	    my $stop_q  = $demotic_infernal_tab::q_stop[$rep]->[$j]; 
	    my $score   = $demotic_infernal_tab::bitscore[$rep]->[$j]; 
	    my $e_val   = $demotic_infernal_tab::Eval[$rep]->[$j]; 
	    my $gc      = $demotic_infernal_tab::GC[$rep]->[$j]; 
	    print "#", ($j+1), ": ",  $contig, 
	          "  Tar:S-E: ",      $start_t, "-", $stop_t, 
	          "  Que:S-E: ",      $start_q, "-", $stop_q, 
	          "  Score: ",        $score, 
	          "  E-val: ",        $e_val, 
	          "  %GC: ",          $gc, 
	          "\n";
	}
    }
}

################# Subroutines  ######################

sub filter_hit {
# INPUT:  Index to _i_th CM in tabfile, _j_th hit for given CM
# OUTPUT: 1 if hit passes command line criteria, 0 if it fails the filter
    my $CM_i   = shift;
    my $hit_j  = shift; 
    my $evalue = $demotic_infernal_tab::Eval[$CM_i]->[$hit_j];
    my $score  = $demotic_infernal_tab::bitscore[$CM_i]->[$hit_j];
    my $gc     = $demotic_infernal_tab::GC[$CM_i]->[$hit_j];
    my $begin  = $demotic_infernal_tab::t_start[$CM_i]->[$hit_j];
    my $end    = $demotic_infernal_tab::t_stop[$CM_i]->[$hit_j];
    my $len    = 0;
    if ($begin < $end ) {  $len = $end   - $begin + 1;  }
    else                {  $len = $begin - $end   + 1;  }
    if (($evalue <= $Eval_cutoff) && 
        ($score  >= $score_cutoff) && 
        ($gc     >= $GC_cutoff) && 
        ($len    >= $len_cutoff)) {
	return 1;   # passes all cutoff criteria
    }
    else {
	return 0;   # does not pass all cutoff criteria
    }
}

sub print_GFF2 {
# INPUT:  Index to _i_th CM in tabfile, _j_th hit for given CM
#         e.g. $demotic_infernal_tab::foo[$i]->[$j]
# OUTPUT: Infernal hit in GFF2 format (see following)
# contig    method   type     start     end      score    ori    frame   gene "genename"; note "blah blah"
# $ctg      $method  $type    $GFFstart $GFFstop $score   $ori   $frame  $gene            $note

    my $CM_i   = shift;
    my $hit_j  = shift; 
    my $ctg    = $demotic_infernal_tab::t_name[$CM_i]->[$hit_j];
    my $start  = $demotic_infernal_tab::t_start[$CM_i]->[$hit_j];
    my $stop   = $demotic_infernal_tab::t_stop[$CM_i]->[$hit_j];
    my $score  = $demotic_infernal_tab::bitscore[$CM_i]->[$hit_j];
    my $frame  = ".";  # used only for CDS
    my $ori    = "";
    my $gene   = "gene \"$demotic_infernal_tab::cm_name[$CM_i]\"";     # default
    my $method = "Infernal";                                           # default
    my $type   = "Infernal_hit";                                       # default
    my $note   = "";                                                   # default
    my $GFFstart = 0;
    my $GFFstop  = 0;
    my $over     = 0;
    if ($opt_g) {  $gene = "gene \"$opt_g\"";  }                       # change default?
    if ($opt_m) {  $method = "$opt_m";  }                              # change default?
    if ($opt_t) {  $type = "$opt_t";   }                               # change default?
    if ($opt_n) {  $note = "; note \"$opt_n\"";  }                     # change default?
# Note: Unlike blastn, target start/stop dictates ori; while query_start is always <= query_stop.
###########################################
#      ========>>>>>>>>>>>>>=========
#              |           |
#     (target) start       stop  
###########################################
    if ($start <= $stop ) {             
	$ori = "+";                     
	$GFFstart = $start - $up_pad;   
	if ($GFFstart < 1) {
	    $over = ($GFFstart * -1) + 1;  # distance beyond 5' end of contig
	    $GFFstart = 1;
	    warn "(${ctg}:${start}-${stop}) couldn't be padded the full $up_pad NTs! START set to 1.";
	    print "\# For following hit; pad overshot $over NTs beyond 5' end of hit; START set to 1\n";
	}
	$GFFstop  = $stop  + $down_pad; # Can't warn against overpadding; I don't know end of ctg!
	print ("$ctg\t$method\t$type\t$GFFstart\t$GFFstop\t$score\t$ori\t$frame\t${gene}$note\n");
    }  
###########################################
#     ========<<<<<<<<<<<<<=========
#             |           |      
#    (target) stop        start
###########################################
    elsif ($start > $stop) {            
	$ori = "-";                     
	$GFFstart = $stop  - $down_pad; 
	if ($GFFstart < 1) {
	    $over = ($GFFstart * -1) + 1;  # distance beyond 5' end of contig
	    $GFFstart = 1;
	    warn "(${ctg}:${start}-${stop}) couldn't be padded the full $up_pad NTs! START set to 1.";
	    print "\# For following hit; pad overshot $over NTs beyond 3' end of hit; START set to 1\n";
	}
	$GFFstop  = $start + $up_pad;   # Can't warn against overpadding; I don't know end of ctg!
	print ("$ctg\t$method\t$type\t$GFFstart\t$GFFstop\t$score\t$ori\t$frame\t${gene}$note\n");
    }
    if ($GFFstart > $GFFstop) {     # meager error checking...
	die "Illegal GFF coords: GFFstart ($GFFstart) > GFFstop ($GFFstop)\n!"; 
    }
}

That script was written for an old version of Infernal. Don't use it. I've written a new one for version 1.1.2. You can get it here:
https://raw.githubusercontent.com/nawrockie/jiffy-infernal-hmmer-scripts/master/infernal-tblout2gff.pl

Or by downloading or cloning the repository it is in:
https://github.com/nawrockie/jiffy-infernal-hmmer-scripts

If you do 'infernal-tblout2gff.pl' you will see a list of options. To run with 'cmscan' output --tblout files, use the --cmscan option. To specify that '--fmt 2' was used with cmscan, use the --fmt2 option.

Let me know if you have any questions or further issues.

Hello Eric,

I am following the protocol in the manuscript "Non-Coding RNA Analysis Using the Rfam Database" to analyze my sRNA seq data. Since my sRNA seq data are short reads, so I followed the suggestion to annotate my reference genome to Rfam first.
I used cmscan, and got the my.cmscan.tblout. In the manuscript it suggested to convert this .tblout file to gff file, the script is
grep -v ˆ# my.cmscan.tblout | awk ' {printf("%s\tinfernal\t%s\t%s\t%s\t%s\t%s\t.\n", $4, $2, $10, $11, $17, $12);}' > my.cmscan.gff. Then I also mapped my sRNA seq data to my reference genome, and got sam file, I converted the sam file to bam file, since the bedtools only accept bam file. The I tried to use the bedtools intersect program to get the overlap of the my.cmscan.gff file and bam file, but the machnice was keeping telling me the my.cmscan.gff is invalid. Is ist because the script for producing the gff file is for the old version of Infernal? I am using the version 1.1.4. Could you write a new one for this version. Thank you.

@WeiXu63: My suggestion is to follow my instructions in this issue from June 7 2019 (#16 (comment))
to use the script here:
https://raw.githubusercontent.com/nawrockie/jiffy-infernal-hmmer-scripts/master/infernal-tblout2gff.pl
To convert between tblout and gff for infernal 1.1.4.

The problem is I am new in this field, I am in the horticulture field, don't know much about perl or programming.

Could you revise this script "grep -v ˆ# my.cmscan.tblout | awk ' {printf("%s\tinfernal\t%s\t%s\t%s\t%s\t%s\t.\n", $4, $2, $10, $11, $17, $12);}' > my.cmscan.gff." in the manuscript? Then I can use it directly. Thank you.

No, the above 'grep' command will not work and I cannot revise it so that it will. However, I wrote the infernal-tblout2gff.pl script specifically for the purpose of converting tblout to gff.

You can use the perl script infernal-tblout2gff.pl by doing the following:

  1. open this page:
    https://raw.githubusercontent.com/nawrockie/jiffy-infernal-hmmer-scripts/master/infernal-tblout2gff.pl
    and choose 'Save as' and save it as infernal-tblout2gff.pl.
  2. Copy your .tblout file into the directory you downloaded the .pl script into and run this command:
    perl infernal-tblout2gff.pl --cmscan my.cmscan.tblout > my.cmscan.gff

Thank you very much, Eric. It worked for me.

Hello,

I need to convert infernal cmsearch output to GFF. I downloaded perl script from https://raw.githubusercontent.com/nawrockie/jiffy-infernal-hmmer-scripts/master/infernal-tblout2gff.pl. It worked perfectly well for an individual tblout file. However, something strange happened when I tried to loop over a bunch of files. I got empty GFFs for ~ 1/2 of files. For the other half, I got an error:

ERROR expected at least 18 space delimited fields in tblout line (fmt 1, default) but got 1 on line:
GCA_000008925.1_ASM892v1_genomic.fna_cmsearch_out.txt 

GCA...txt. is a name of a tblout file. I checked several files invoking the error, they look OK and have 18 fields.

My script for looping looks like that:

#!/bin/bash
#SBATCH --job-name=inf2gff
#SBATCH --error=inf2gff.err
#SBATCH --output=inf2gff.log
#SBATCH --time=10:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=all
#SBATCH --mem=500M

while read LST; do
    echo ${LST}
    perl infernal-tblout2gff.pl --all cmsearch_output/${LST} > tboxes_gffs/${LST}.gff
done<cms.out

I can't see any obvious mistake in my script and will appreciate any help or advice! Thanks in advance.

@octolis : can you please provide a simple example script and input files I can use to reproduce it locally so that I can look into the problem for you?

Sure, thank you!
data.tar.gz

@octolis - the problem had to do with how the script determines if the input is a tblout file or coming from stdin, and how that logic fails to decide it's a tblout file in the context of your specific shell script. I made a change to the script here:
https://github.com/nawrockie/jiffy-infernal-hmmer-scripts and it should work with your script now.

Dear Eric, thank you for your help (sorry for not replying, have been out of office)! I tested the new script and it worked perfectly. Thank you.