/pirs

profile basd Illumina pair-end Reads Simulator

Primary LanguageC++GNU General Public License v2.0GPL-2.0

	     pIRS (profile based Illumina pair-end Reads Simulator)

Contents
========
1. Introduction
2. Program framework
3. Usage
4. Examples
5. Output file format
6. Notes

1 Introduction
==============

pIRS is a program for simulating paired-end reads from a reference genome.  It
is optimized for simulating reads similar to those generated from the Illumina
platform.

See `INSTALL' for installation instructions.

There are two subcommands: `pirs simulate' and `pirs diploid'.  See section 3
for more details, or run `pirs simulate -h' or `pirs diploid -h'.

2 Program framework
===================
  2.1 Profile Generator
    Six tools are supplied. SOAP2 or BWA, soap.coverage (http://soap.genomics.org.cn/soapaligner.html) 
  are required. The full process is shown in getprofile.sh.example as an example.
    2.1.1 GC%-Depth Profile Stat.
      a). Run soap and soap.coverage to get .depth single file(s). gzip is OK to over it.
      b). Run gc_coverage_bias on all depth single files. You will get gc-depth stat by 1 GC% and other files.
      c). Run gc_coverage_bias_plot on the gc-depth stat file. You'll get PNG plot and a .gc file by 5 GC%.
      d). Manually check the .gc file for any abnormal levels due to the lower depth on certain GC% windows.
    2.1.2 Base-Calling Profile Stat:
      a). Run soap or bwa to get .{soap,single} or .sam file(s).
      b). Run error_matrix_calculator on those file(s). You will get *.{count,ratio}.matrix .
      c). You can use error_matrix_merger to merge several .{count,ratio}.matrix files.
         However, it is up to you to keep the read length matches.
    2.1.3 InDel Profile Stat:
      a). Choose samples with NO polymorphism InDel, such as the Coliphage samples that shipped with Illumina Sequencers.
      b). Run bwa to get .sam/.bam file.
      c). Run indelstat_sam_bam to get the profile.
    2.1.4 Insert size & mapping ratio stat:
      a). Run soap or bwa to get .{soap,single} or .sam file(s).
      b). Run alignment_stator *.
         * alignment_stator cannot stat. mapping ratio for sam files now.
  2.2 Simulator
    Two commands:
    pirs diploid: use for generating diploid genome sequence. Read the input genome sequence and 
                  then simulate SNP, InDel, SV(structure variation) on it. At last, output the 
                  result genome sequence.
    pirs simulate: use for simulating Illumina data, output PE-read file.
    Note:
      a) If you only want to simulate the diploid genome sequence, "pirs diploid" is enough.
      b) If you want to simulate sequencing data of haploid genome, only you need is "pirs simulate".
      c) If you want to simulate sequencing data of diploid genome, you first need to run "pirs diploid" 
         to get the other diploid genome sequence, and then run "pirs simulate" using both the original 
         genome sequence and the previous output sequence as the input.

3 Usage
=======

 pirs <command> [option]
    diploid     generate diploid genome.
    simulate    simulate Illumina reads.

  3.1 pirs diploid
	Usage: pirs diploid [OPTIONS...] REFERENCE

	Simulate a diploid genome by creating a copy of a haploid genome with
	heterozygosity introduced.  REFERENCE specifies a FASTA file containing
	the reference genome.  It may be compressed (gzip).  It may contain multiple
	sequences (scaffolds or chromosomes), each marked with a separate FASTA tag
	line.  The introduced heterozygosity takes the form of SNPs, indels, and
	large-scale structural variation (insertions, deletions and inversions).
	If REFERENCE is '-', the reference sequence is read from stdin, but it must be
	uncompressed.

	The probabilities of SNPs, indels, and large-scale structural variation can be
	specified with the '-s', '-d', and '-v' options, respectively.  You can also
	set the ratio of transitions to transversions (for SNPs) with the '-R' option.

	Indels are split evenly between insertions and deletions. The length
	distribution of the indels is as follows and is derived from panda
	re-sequencing data:
		1bp	64.82%
		2bp	17.17%
		3bp	7.20%
		4bp	7.29%
		5bp	2.18%
		6bp	1.34%

	Large-scale structural variation is split evenly among large-scale insertions,
	deletions, and inversions.  By default, the length distribution of these
	large-scale features is as follows:
		100bp	70%
		200bp	20%
		500bp	7%
		1000bp	2%
		2000bp	1%

	`pirs diploid' does not use multiple threads, even if pIRS was configured with
	--enable-multiple threads.

	OPTIONS:
	  -s, --snp-rate=RATE    A floating-point number in the interval [0, 1] that
				   specifies the heterozygous SNP rate.  Default: 0.001

	  -d, --indel-rate=RATE  A floating-point number in the interval [0, 1] that
				   specifies the heterozygous indel rate.
				   Default: 0.0001

	  -v, --sv-rate=RATE     A floating-point number in the interval [0, 1] that
				 specifies the large-scale structural variation
				 (insertion, deletion, inversion) rate in the diploid
				   genome. Default: 0.000001

	  -R, --transition-to-transversion-ratio=RATIO
				 In a SNP, a transition is when a purine or pyrimidine
				   is changed to one of the same (A <=> G, C <=> T)
				   while a transversion is when a purine is changed
				   into a pyrimidine or vice-versa.  This option
				   specifies a floating-point number RATIO that gives
				   the ratio of the transition probability to the
				   transversion probability for simulated SNPs.
				   Default: 2.0

	  -o, --output-prefix=PREFIX
				 Use PREFIX as the prefix of the output file and logs.
				   Default: "pirs_diploid"

	  -O, --output-file=FILE
				Use FILE as the name of the output file. Use '-'
				   for standard output; this also moves the
				   informational messages from stdout to stderr.

	  -c, --output-file-type=TYPE
				 The string "text" or "gzip" to specify the type of
				   the output FASTA file containing the diploid copy
				   of the genome, as well as the log files.
				   Default: "text"

	  -n, --no-logs          Do not write the log files.

	  -S, --random-seed=SEED Use SEED as the random seed. Default:
				   time(NULL) * getpid()

	  -q, --quiet            Do not print informational messages.

	  -h, --help             Show this help and exit.

	  -V, --version          Show version information and exit.

	EXAMPLE:
	  ./pirs diploid ref_sequence.fa -s 0.001 -d 0.0001 -v 0.000001\
			 -o ref_sequence >pirs.out 2>pirs.err
      
  3.2 pirs simulate

	Usage: ./pirs simulate [OPTION]... REFERENCE.FASTA...

	pIRS is a program for simulating paired-end reads from a genome.  It is
	optimized for simulating reads from the Illumina platform.  The input to
	pIRS is any number of reference sequences. Typically you would just provide
	one FASTA file containing your reference sequence, but you may provide two
	if you have generated a diploid sequence with `pirs diploid', or if you have
	chromosomes split up into multiple FASTA files.  The output of pIRS is two
	FASTQ files containing the simulated paired-end reads, as well as several log
	files.

	Input sequences are assumed to be haploid.  If you instead want to simulate
	reads from a diploid genome, you must give the --diploid option so that
	the diploidy is taken into account when computing coverage.  If you do
	not do this, you will get twice as many reads as you wanted.

	pIRS simulates a normally-distributed insert (fragment) length using the
	Box-muller method.  Usually you want the insert length standard deviation to
	be 1/20 or 1/10 of the insert length mean (see the -m and -v options).
	This program also simulates Illumina sequencing error, quality score and
	GC bias based on empirical distribution profiles. Users may use the default
	profiles in this package, which are generated by large real sequencing data,
	or they may generate their own profiles.

	OPTIONS:
	  -l LEN, --read-len=LEN
			Generate reads having a length of LEN.  Default: 100

	  -x VAL, --coverage=VAL
			 Set the average sequencing coverage (sometimes called depth).
			 It may be either a floating-point number or an integer.

	  -m LEN, --insert-len-mean=LEN
			 Generate inserts (fragments) having an average length of LEN.
			 Default: 180

	  -v LEN, --insert-len-sd=LEN
			 Set the standard deviation of the insert (fragment) length.
			 Default: 10% of insert length mean.

	  -j, --jumping, --cyclicize
			 Make the paired-end reads face away from either other, as
			 in a jumping library.  Default: the reads face towards each
			 other.

	  -d, --diploid
			 This option asserts that reads are being simulated from a
			 diploid genome.  It causes the program to abort if there
			 are not exactly two reference sequences; in addition, the
			 coverage is divided in half, since the two reference
			 sequences are in reality the same genome.  This option
			 is not required to simulate diploid reads, but you must
			 set the coverage correctly otherwise (it will be half
			 as much as you think).

	  -B FILE, --base-calling-profile=FILE, --subst-error-profile=FILE
			 Use FILE as the base-calling profile.  This profile will be
			 used to simulate substitution errors.  Default:
			 PREFIX/share/pirs/Base-Calling_Profiles/humNew.PE100.matrix.gz

	  -I FILE, --indel-error-profile=FILE, --indel-profile=FILE
			 Use FILE as the indel-error profile.  This profile will be
			 used to simulate insertions and deletions in the reads that
			 are artifacts of the sequencing process.  Default:
			 PREFIX/share/pirs/InDel_Profiles/phixv2.InDel.matrix

	  -G FILE, --gc-bias-profile=FILE, --gc-content-bias-profile=FILE
			 Use FILE as the GC content bias profile.  This profile will
			 adjust the read coverage based on the GC content of
			 fragments.  Defaults:
			 PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_100.dat,
			 PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_150.dat,
			 PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_200.dat,
			 depending on the mean insert length.

	  -e FILE, --error-rate=RATE, --subst-error-rate=RATE
			 Set the substitution error rate.  The base-calling profile
			 will still be used, but the average frequency of errors will
			 be changed to RATE.  Set to 0 to disable substitution errors
			 completely.  In that case, the base-calling profile will not
			 be used.  Default: default error rate of base-calling
			 profile.

			 Note: since pIRS parameterizes the error rate by
			 several parameters, it is very difficult to determine exactly
			 what needs to be done to make the error rate be a given
			 value.  We try to adjust the probabilities of getting each
			 quality score in order to accomodate the user-supplied error
			 rate.  However, depending on your input sequences, the actual
			 error rate simulated by pIRS could be off by 20% or more.
			 Please check the informational output to see the final error
			 rate that was actually simulated.

	  -A ALGO, --substitution-error-algorithm=ALGO, --subst-error-algo=ALGO
			 Set the algorithm used for simulating substitition errors.
			 It may be set to the string "dist" or "qtrans".  The
			 default is "qtrans".

			 The "dist" algorithm looks up the substitution error rate
			 for each base pair based on the current cycle and the true
			 base.  This lookup produces a quality score and a called base
			 that may or may not be the same as the true base.  In the
			 base-calling profile, the matrix we use is marked as the
			 [DistMatrix].

			 The "qtrans" algorithm is a Markov-chain model based on the
			 previous quality score and current cycle.  The next quality
			 score is looked up with a certain probability based on these
			 parameters.  The matrix used for this is marked as
			 [QTransMatrix] in the base-calling profile. Then, the the
			 DistMatrix is used to find a called base for the quality score.
			 The DistMatrix is also used to call the base in the first
			 cycle.

	  -M MODE, --mask=MODE, --eamss=MODE
			 Use the EAMSS algorithm for masking read quality.  MODE may be
			 the string "quality" or "lowercase".  The EAMSS algorithm
			 identifies low-quality, GC-rich regions near the ends of reads.
			 "quality" mode will change the quality scores on these
			 regions to (2 + quality_shift), while "lowercase" mode
			 will change the base pairs to lower case, but not change
			 the quality values.  Default: Do not use EAMSS.

	  -Q VAL, --quality-shift=VAL, --phred-offset=VAL
			 Set the ASCII shift of the quality value (usually 64 or 33 for
			 Illumina data).  Default: 33

	  --no-quality-values
	  --fasta
			 Do not simulate quality values.  The simulated reads will be
			 written as a FASTA file rather than a FASTQ file.
			 Substitution errors may still be done; if you do not want
			 to simulate any substition errors, provide --error-rate=0 or
			 --no-substitution-errors.

	  --no-subst-errors
	  --no-substitution-errors
			 Do not simulate substitution errors.  Equivalent to
			 --error-rate=0.

	  --no-indels
	  --no-indel-errors
			 Do not simulate indels.  The indel error profile will not be
			 used.

	  --no-gc-bias
	  --no-gc-content-bias
			 Do not simulate GC bias.  The GC bias profile will not be
			 used.

	  -o PREFIX, --output-prefix=PREFIX
			 Use PREFIX as the prefix of the output files.  Default:
			 "pirs_reads"

	  -c TYPE, --output-file-type=TYPE
			 The string "text" or "gzip" to specify the type of
			 the output FASTQ files containing the simulated reads
			 of the genome, as well as the log files.  Default: "text"

	  -z, --compress
			 Equivalent to -c gzip.

	  -n, --no-logs, --no-log-files
			 Do not write the log files.

	  -S SEED, --random-seed=SEED
			 Use SEED as the random seed. Default:
			 time(NULL) * getpid().  Note: If pIRS was not compiled with
			 --disable-threads, each thread actually uses its own random
			 number generator that is seeded by this base seed added to
			 the thread number; also, if you need pIRS's output to be
			 exactly reproducible, you must specify the random seed as well
			 as use only 1 simulator thread (--threads=1, or configure
			 with --disable-threads, or run on system with 4 or fewer
			 processors).

	  -t, --threads=NUM_THREADS
			 Use NUM_THREADS threads to simulate reads.  This option is
			 not available if pIRS was compiled with the --disable-threads
			 option.  Default: number of processors minus 2 if writing
			 uncompressed output, or number of processors minus 3 if
			 writing compressed output, or 1 if there are not this many
			 processors

	  -q, --quiet    Do not print informational messages.

	  -h, --help     Show this help and exit.

	  -V, --version  Show version information and exit.


4 Examples
==============
  4.1 Simulating a diploid genome sequence.
    Example command line:
    pirs diploid Human_ref.fa -s 0.001 -R 2 -d 0.00001 -v 0.000001 -o Human >simulate_seq.o 2>simulate_seq.e
    Output files:
      a) Human.snp.indel.invertion.fa:   another diploid genome sequence.
      b) Human_indel.lst:                InDel information list.
      c) Human_snp.lst:                  SNP information list.
      d) Human_invertion.lst:            invertion information list.
      e) simulate_seq.o, simulate_seq.e: records of the program running information.
  4.2 Simulate Illumina paired-end reads from a haploid genome.
    Example command line:
    pirs simulate Human_ref.fa -m 170 -l 90 -x 5 -v 10 -o Human >simulate_170.o 2>simulate_170.e
    Output files:
      a) Human_90_170_1.fq, Human_90_170_2.fq:    the paired-end read files.
      b) Human_90_170.error_rate.distr:           the error distribution file.
      c) Human_90_170.insert_len.distr:           the insert length distribution file.
      d) Human_90_170.read.info:                  information about every simulated reads
      e) simulate_170.o, simulate_170.e:          records of the program running information.
  4.3 Simulate Illumina paired-end reads from a diploid genome.
    Example command line:
    pirs simulate --diploid Human_ref.fa Human.snp.indel.invertion.fa.gz -m 800 -l 70 -x 5 -v 10 
                            -o Human >simulate_800.o 2>simulate_800.e
    Output files:
      a) Human_70_800_1.fq, Human_70_800_2.fq:   the pair-end read files.
      b) Human_70_800.error_rate.distr:          the error distribution file.
      c) Human_70_800.insertsize.distr:          the insert size distribution file.
      d) Human_70_800.read.info.gz:              record the information of every reads.
      e) simulate_800.e, simulate_800.o:         records of the program running information.

5 Output file format
====================
  a) *.fq/*.fq.gz

    FASTQ files containing the simulated reads.  The files are given names
    similar to PREFIX_70_800_1.fq, where PREFIX is the prefix provided by the -o
    option (default: "pirs_reads"), 70 is the read length, 800 is the mean
    insert length, and 1 means the file for read 1 of the read pairs.  These
    files will be written as GZIP files with the .gz suffix if you provide the
    `-c gzip' option.

      @read_800_21/1
      ACGGAAAAGTTACGCTATCGCATGCGTGTAAGAACACTGCTCCTACGCCCATTTTATCGATGGCGCCCAG
      +
      egcggdggfgfgggggfeggggYbcgegfgggbggg^e]egfegggfbSeggdggegg`^eJgggcbEeb
      @read_800_22/1
      CACGGGGGGACTTTATTTAATGAGCGGCTGTAACTTGGTCCGTCGTTTGAGAGGGGACACCTCATATGAT
      +
      gggggegcgeggggcgfcgc_gf_ggfefcgVgegcfcdgf`geggdd[ge`ggafeggggdgdgee^gg
    
  b) *.read.info

    Information about each simulated read.
    
    Column 1: reads ID.
    Column 2: Name of the reference file; this is mainly provided so that you
              tell which chromosome set a read came from if you simulate reads
              from a diploid genome produced with the `pirs diploid' command.
    Column 3: FASTA/FASTQ sequence ID of the contig/scaffold/chromosome.
    Column 4: position(1-based) in chromosome.
    Column 5: "+" forward direction; "-" reverse direction.
    Column 6: real insert size.
    Column 7: length of read-end masking by EAMSS algorithm.
    Column 8: read-position(1-based) of substitution error and raw base(->)error base.
    Column 9: read-position(1-based) of insertion error and the base of insertion.
    Column 10: read-position(1-based) of deletion error and the base of deletion

  c) *_snp.lst
    For example:
    
      I	3	3	G	A 
      I	45342	45355	C	T 
      I	104775	104680	C	T
      .....
    Column 1: chromosome sequence ID.
    Column 2: position(1-based) of SNP in reference chromosome.
    Column 3: position(1-based) of SNP in simulated diploid chromosome.
    Column 4: base of reference chromosome sequence.
    Column 5: the base of SNP.

  d) *_indel.lst
    For example:
      I  - 3 1 C
      IV + 5 2 AC
      .....
    Column 1: chromosomesequence ID.
    Column 2: "-" Deletion; "+" Insertion.
    Column 3: position(1-based) of InDel in the reference chromosome sequence.
              For deletions, it is the position of the first deleted base.  For
              insertions, it is the position of the reference base corresponding
              to the base directly before the insertion. 
    Column 4: position(1-based) of InDel in the diploid sequence.  For
              deletions, it is the position of the diploid sequence base
              corresponding to the reference base directly before the deletion.
              For insertions, it is the position of the first inserted base.
    Column 5: length of InDel (always positive).
    Column 6: sequence of the InDel.

      I - 3 3 1 C
      Ref: a t G c a // 3 is the position of G base in the chromosome base.
         : a t G - a // following the G base is a deletion of "C" base.
      IV + 5 5 2 A C
      Ref: t t g c A - - g t t// 5 is the position of A base in the chromosome base.
         : t t g c A A C g t t// following the A base is a insertion of "AC" bases.

  e) *_inversion.lst
    For example:
      I	50191	50195	100 
      I	948984	948903	200
    Column 1: chromosome sequence ID.
    Column 2: position(1-based) of beginning of inversion in the reference
              chromosome sequence.
    Column 3: position(1-based) of beginning of inversion of the diploid
              chromosome sequence.
    Column 3: length of inversion.

6 Notes
======
  a) pIRS does not simulate reads containing "N" char.  If your input reference
     contains the "N" character, reads generated in this in this region will be
     discarded.
  b) When running a simulation without the --no-subst-errors option, the maximum
     length of the simulated reads depends on the number of cycles recorded in
     the base-calling profile.  The user must set the read length to no more
     than half the number of cycles recorded in the base-calling profile.
  c) When masking quality values, the program uses the same EAMSS algorithm
     from CASAVA v1.8.0.  This is only done if the --eamss option is supplied.
  d) The program parses one chromosome at a time.  Reads are evenly distributed
     among the chromosomes.
  e) To re-do a simulation and get the exact same results, you must set the
     random seed with the -S parameter.  In addition, you must use the
     single-threaded version of pIRS, or else set --threads=4 (for only one
     simulator thread).
  f) pIRS will shift quality values when the substitution-error rate setting by
     user is different from that in profile. The quality score in the output
     data range from 2 to the maximum score of profile.  You can find the real
     substitution-error rate of output data in file *.error_rate.distr.
  
  For update & support, please refer to http://code.google.com/p/pirs/ .