This pipeline will allow the command-line submission of two FASTA datasets, one containing a collection of ESTs/unigenes and the other genomic sequences. Options are also listed on the command-line.
- RepeatMasker: Masking of the query sequences.
- Geneseqer or BLAT: spliced alignment of query sequences with genomic sequences to identify putative introns.
- Pipeline parses of Geneseqer/BLAT output to compile Primer 3 input file.
- Primer3: Designing of primers around putative splice sites based on the EST sequence.
- Pipeline generates combined output.
- Optional Auto-curation or filtering of results.
IMP [ options ] <templateFASTA> <queryFASTA>
Where:
<templateFASTA>
and<queryFASTA>
will be replaced with the path to your two input files (more).- The options are optional but when present each option must be preceded by a - and of the form specified below.
Example: perl IMP.pl -a=g -NoMask -p ../Genomes/Medicago/Mt2.0_FrozenBACs.fas ../BFT_ESTdatasets/subsets/BFT.unigenes.subset
The above example will will run IMP with a template of "../Genomes/Medicago/Mt2.0_FrozenBACs.fas" and a query of "../BFT_ESTdatasets/subsets/BFT.unigenes.subset". The options -a=g specifies the use of Geneseqer for step 3, -NoMask skips step 2 and -p ensures the query will be preprocessed before Geneseqer is run.
There are two input files for this pipeline: query (ESTs/unigenes) and template (genomic). The query should be a single batch FASTA file containing the ESTs/unigenes for which the primers will be designed. The template should also be a single batch FASTA file containing the genomic information, which will be used to predict the location of splice sites/introns.
There are a number of files generated by the helper programs within this pipeline. For documentation of the information within these files please see the original program documentation.
Program | Files generated | Location |
---|---|---|
RepeatMasker | .cat, .log, .masked, .out, .ref and .tbl | Same as query sequence |
Geneseqer | .GeneSeqer.out | Current directory |
Primer3 | .primer3.out.boulderIO | Current directory |
Files generated by this program include:
File generated | Description | Location |
---|---|---|
<query>.primer3.in.boulderIO |
The input file for primer3. This file was generated through parsing of the geneseqer output and calculation of information pertaining to introns | Current directory |
<query>.primerorder.IMP |
A simplified output of the primers generated for the purposes of ordering. Contains the name of the primer, Tm, % GC content, Length & sequence for each primer | Current directory |
<query>.out.IMP |
Complete output for this pipeline. Includes information about the query, template, intron and primer for each set (forward and reverse) of primers. | Current directory |
<query>.sorted.IMP |
Contains all of the information in out.IMP sorted by EST, Gene Model, Intron. | Current directory |
<query>.curated.IMP |
Contains only the selected records as indicated by the autocurate option. All entries are sorted by EST, Gene Model, and Intron. | Current directory |
<query>.curated.details |
Contains a detailed account of the curation process: outlines which statistic was used to compare records, the value for each record and which record was chosen. | Current directory |
Each option must be preceded by a -. Either the short or long name of the option may be used. If the option is not specified the actions associated with default value will occur when the program is run. These options are case insensitive. For options for use with BLAT or Geneseqer, if the opposite program is specified the option will have no effect.
Use with | Short | Long | Values | Description | Default |
---|---|---|---|---|---|
General | a=s | Alignment=s | s = b(BLAT) or g(GeneSeqer) | Choose the program to be used to align the ESTs to the Genome where s = b for blat and g for geneseqer. | g(Geneseqer) |
General | d=s | rmDash | s = q (process query only), t (process template only) or b (process both) | Removes any dashes ( - ) from the query sequence and replaces them with a space (" "). Also ensures that the query is the proper sequence for input into the pipeline. | q |
General | 5extensionFwd=s | s=a string composed of GACTgact | Allows the user to enter s, which will be added to the 5' end of the forward primer | FALSE | |
General | 5extensionRvs=s | s=a string composed of GACTgact | Allows the user to enter s, which will be added to the 5' end of the reverse primer | FALSE | |
General | c=s | autocuration=s | s= any combination of G (only output the best gene model per EST), I (only output the best intron per Gene Model), or P (only output the best primer pair per intron). | Allows specification of auto-curation to be done. | No Auto-curation |
RepeatMasker | r | noMask | Does not run repeat masker. | FALSE | |
BLAT | o | ooc | Allows the generation of the ooc required for BLAT (5.ooc for cross-species DNA to DNA) | FALSE | |
BLAT | q | useQuality | Adds quality information for designing primers. File containing the quality information must be of the form queryfilename.qual | FALSE | |
BLAT | Bt=s | s = dna(DNA sequence) or prot(protein sequence) or dnax(DNA sequence translated in six frames to protein) | Type of sequences in the template input file | dnax | |
BLAT | Bq=s | s = dna(DNA sequence) or prot(protein sequence) or dnax(DNA sequence translated in six frames to protein) | Type of sequences in the query input file | dnax | |
GeneSeqer | p | preProcess | Allows preprocessing of the query by running MakeArray distributed with Geneseqer | FALSE | |
GeneSeqer | s=sp | species | sp = "human", "mouse", "rat", "chicken", "Drosophila", "Daphnia", "nematode", "yeast", "Aspergillus", "Arabidopsis", "maize", "rice", "Medicago", or "generic" | Splice site model used by Geneseqer | "Medicago" |
Primer3 | PoptSize=i | i = integer | Optimum length (in bases) of a primer oligo. Primer3 will attempt to pick primers close to this length. | 20 | |
Primer3 | PminSize=i | i = integer | Minimum acceptable length of a primer. Must be greater than 0 and less than or equal to PmaxSize. | 18 | |
Primer3 | PmaxSize=i | i = integer | Maximum acceptable length (in bases) of a primer. Currently this parameter cannot be larger than 35. This limit is governed by maximum oligo size for which primer3's melting-temperature is valid. | 27 | |
Primer3 | PoptTm=f | f = decimal | Optimum melting temperature(Celsius) for a primer oligo. Primer3 will try to pick primers with melting temperatures are close to this temperature. | 60.0 C | |
Primer3 | PminTm=f | f = decimal | Minimum acceptable melting temperature(Celsius) for a primeroligo. | 57.0 C | |
Primer3 | PmaxTm=f | f = decimal | Maximum acceptable melting temperature(Celsius) for a primer oligo. | 63.0 C | |
Primer3 | PMask=i | i = 0(ignore) or 1(reject primers overlapping lowercase bases exactly at the 3' end) | This option allows for intelligent design of primers in sequence in which masked regions (for example repeat-masked regions) are lower-cased. | 1 | |
Primer3 | PminGC=f | f=decimal | Minimum allowable percentage of Gs and Cs in any primer. | 20.0% |
This option was added in version 2 and allows filtering of the results based on a number of criteria. These criteria are specified using the -autocuration=s option.
- G -only output the best gene model (highest similarity) per EST,
- I -only output the best intron (highest average donor/acceptor similarity) per Gene Model,
- P -only output the best primer pair (lowest primer pair penalty) per intron. Each of these options is independent of each other. The selected records will be compiled in the .curated.IMP file and will be sorted by EST, Gene Model, and Intron.
The following lists further describe headers found in various output files.:
- Name: This is the FASTA header of the query sequence.
- No. Repeats: The number of repeats masked in the query sequence.
- Strand: The strand of the query used for the gapped alignment.
- Size: The length of this particular query sequence.
- Start/End: The start or end position of the gapped alignment.
- Name: This is the FASTA header of the genome sequence
- Size: The length of this particular genome sequence
- Strand: The strand of the template used for the gapped alignment
- Start/End: The start or end position of the gapped alignment
- Match/Mismatch:The total number of matches and mismatches in the gapped alignment
- No.Introns:The total number of introns predicted by the gapped alignment
- Curr. Intron:The inton used to design the primerset
- Start/End:The start or end position on the template sequence of the current intron
- Length:The predicted length of the current intron
- Trg Length:The length of the splice site on the query
- Included Reg:The start, length on the query of the region to be used to design the primers. Only primers within this region are valid.
Pertains to Forward/Reverse Primers
- Start: This is the 0-based index of the first or last base of the primer for forward or reverse primer respectively.
- Length:The number of bases the primer is composed of.
- Sequence: The actual sequence of the oligo. The sequence of left primer and internal oligo is presented 5' -> 3' on the same strand as the query sequence. The sequence of the right primer is presented 5' -> 3' on the opposite strand from the query sequence.
- Tm: The melting temperature for the selected oligo.
- GC: The percent GC for the selected oligo (denominator is the number of non-ambiguous bases).
Pertains to primer set Diagnostics:
- EST Product Size: The size of the PCR product if the EST sequence was used as a template
- Genomic Product Size: The predicted length of the PCR product if the genomic sequence was used as a template
- Pair Penalty: The value of the objective function for this pair (lower is better). This value includes all of the penalty weights that were previously set. All defaults were kept for these values (for more information see the Primer3 Documentation)
- Compl-Any & Compl-End: The inter-pair complementarity measures for the selected forward and reverse primer
- F/R Self-Any & F/R Self-End: Two floats delinated by a forward slash which indicate the self-complimentary measure for a selected oligo (for more information see the Primer3 Documentation).
- F/R End Stability: Two floats delinated by a forward slash which indicate the delta G of disruption of the five 3' bases of the primer (for more information see the Primer3 Documentation).
- Smit AFA, Hubley R & Green P, unpublished data. RepeatMasker version open 3.2.7 available from: http://repeatmasker.org. RepeatMasker incorporates the following programs: cross_match 1a, tandem repeat finder 1b and RepBase 1c.
- Green P, unpublished data. Crossmatch available from http://www.phrap.org/phredphrapconsed.html.
- Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27(2): 573-580.
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J (2005) Repbase Update, a database of eukaryotic repetitive elements. Cytogentic and Genome Research 110:462-467.
- Usuka J, Zhu W, Brendel V (2000) Optimal spliced alignment of homologous cDNA to a genomic template. Bioinformatics 16(3): 203-211. Source code available at http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi.
- Kent WJ (2002) BLAT – The BLAST-Like Alignment Tool. Genome Research 4: 656-664.
- Rozen S and Skaletsky HJ (2000) Primer3 on the WWW for general users and for biologist programmers. In: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in Molecular Biology. Humana Press, Totowa, NJ, pp 365-386. Source code available at http://sourceforge.net/projects/primer3/.
- Schuler GD (1997) Sequence mapping by electronic PCR. Genome Research 7(5): 541-550.