pIRS (profile based Illumina pair-end Reads Simulator) Contents ======== 1. Introduction 2. Program framework 3. Usage 4. Examples 5. Output file format 6. Notes 1 Introduction ============== pIRS is a program for simulating paired-end reads from a reference genome. It is optimized for simulating reads similar to those generated from the Illumina platform. See `INSTALL' for installation instructions. There are two subcommands: `pirs simulate' and `pirs diploid'. See section 3 for more details, or run `pirs simulate -h' or `pirs diploid -h'. 2 Program framework =================== 2.1 Profile Generator Six tools are supplied. SOAP2 or BWA, soap.coverage (http://soap.genomics.org.cn/soapaligner.html) are required. The full process is shown in getprofile.sh.example as an example. 2.1.1 GC%-Depth Profile Stat. a). Run soap and soap.coverage to get .depth single file(s). gzip is OK to over it. b). Run gc_coverage_bias on all depth single files. You will get gc-depth stat by 1 GC% and other files. c). Run gc_coverage_bias_plot on the gc-depth stat file. You'll get PNG plot and a .gc file by 5 GC%. d). Manually check the .gc file for any abnormal levels due to the lower depth on certain GC% windows. 2.1.2 Base-Calling Profile Stat: a). Run soap or bwa to get .{soap,single} or .sam file(s). b). Run error_matrix_calculator on those file(s). You will get *.{count,ratio}.matrix . c). You can use error_matrix_merger to merge several .{count,ratio}.matrix files. However, it is up to you to keep the read length matches. 2.1.3 InDel Profile Stat: a). Choose samples with NO polymorphism InDel, such as the Coliphage samples that shipped with Illumina Sequencers. b). Run bwa to get .sam/.bam file. c). Run indelstat_sam_bam to get the profile. 2.1.4 Insert size & mapping ratio stat: a). Run soap or bwa to get .{soap,single} or .sam file(s). b). Run alignment_stator *. * alignment_stator cannot stat. mapping ratio for sam files now. 2.2 Simulator Two commands: pirs diploid: use for generating diploid genome sequence. Read the input genome sequence and then simulate SNP, InDel, SV(structure variation) on it. At last, output the result genome sequence. pirs simulate: use for simulating Illumina data, output PE-read file. Note: a) If you only want to simulate the diploid genome sequence, "pirs diploid" is enough. b) If you want to simulate sequencing data of haploid genome, only you need is "pirs simulate". c) If you want to simulate sequencing data of diploid genome, you first need to run "pirs diploid" to get the other diploid genome sequence, and then run "pirs simulate" using both the original genome sequence and the previous output sequence as the input. 3 Usage ======= pirs <command> [option] diploid generate diploid genome. simulate simulate Illumina reads. 3.1 pirs diploid Usage: pirs diploid [OPTIONS...] REFERENCE Simulate a diploid genome by creating a copy of a haploid genome with heterozygosity introduced. REFERENCE specifies a FASTA file containing the reference genome. It may be compressed (gzip). It may contain multiple sequences (scaffolds or chromosomes), each marked with a separate FASTA tag line. The introduced heterozygosity takes the form of SNPs, indels, and large-scale structural variation (insertions, deletions and inversions). If REFERENCE is '-', the reference sequence is read from stdin, but it must be uncompressed. The probabilities of SNPs, indels, and large-scale structural variation can be specified with the '-s', '-d', and '-v' options, respectively. You can also set the ratio of transitions to transversions (for SNPs) with the '-R' option. Indels are split evenly between insertions and deletions. The length distribution of the indels is as follows and is derived from panda re-sequencing data: 1bp 64.82% 2bp 17.17% 3bp 7.20% 4bp 7.29% 5bp 2.18% 6bp 1.34% Large-scale structural variation is split evenly among large-scale insertions, deletions, and inversions. By default, the length distribution of these large-scale features is as follows: 100bp 70% 200bp 20% 500bp 7% 1000bp 2% 2000bp 1% `pirs diploid' does not use multiple threads, even if pIRS was configured with --enable-multiple threads. OPTIONS: -s, --snp-rate=RATE A floating-point number in the interval [0, 1] that specifies the heterozygous SNP rate. Default: 0.001 -d, --indel-rate=RATE A floating-point number in the interval [0, 1] that specifies the heterozygous indel rate. Default: 0.0001 -v, --sv-rate=RATE A floating-point number in the interval [0, 1] that specifies the large-scale structural variation (insertion, deletion, inversion) rate in the diploid genome. Default: 0.000001 -R, --transition-to-transversion-ratio=RATIO In a SNP, a transition is when a purine or pyrimidine is changed to one of the same (A <=> G, C <=> T) while a transversion is when a purine is changed into a pyrimidine or vice-versa. This option specifies a floating-point number RATIO that gives the ratio of the transition probability to the transversion probability for simulated SNPs. Default: 2.0 -o, --output-prefix=PREFIX Use PREFIX as the prefix of the output file and logs. Default: "pirs_diploid" -O, --output-file=FILE Use FILE as the name of the output file. Use '-' for standard output; this also moves the informational messages from stdout to stderr. -c, --output-file-type=TYPE The string "text" or "gzip" to specify the type of the output FASTA file containing the diploid copy of the genome, as well as the log files. Default: "text" -n, --no-logs Do not write the log files. -S, --random-seed=SEED Use SEED as the random seed. Default: time(NULL) * getpid() -q, --quiet Do not print informational messages. -h, --help Show this help and exit. -V, --version Show version information and exit. EXAMPLE: ./pirs diploid ref_sequence.fa -s 0.001 -d 0.0001 -v 0.000001\ -o ref_sequence >pirs.out 2>pirs.err 3.2 pirs simulate Usage: ./pirs simulate [OPTION]... REFERENCE.FASTA... pIRS is a program for simulating paired-end reads from a genome. It is optimized for simulating reads from the Illumina platform. The input to pIRS is any number of reference sequences. Typically you would just provide one FASTA file containing your reference sequence, but you may provide two if you have generated a diploid sequence with `pirs diploid', or if you have chromosomes split up into multiple FASTA files. The output of pIRS is two FASTQ files containing the simulated paired-end reads, as well as several log files. Input sequences are assumed to be haploid. If you instead want to simulate reads from a diploid genome, you must give the --diploid option so that the diploidy is taken into account when computing coverage. If you do not do this, you will get twice as many reads as you wanted. pIRS simulates a normally-distributed insert (fragment) length using the Box-muller method. Usually you want the insert length standard deviation to be 1/20 or 1/10 of the insert length mean (see the -m and -v options). This program also simulates Illumina sequencing error, quality score and GC bias based on empirical distribution profiles. Users may use the default profiles in this package, which are generated by large real sequencing data, or they may generate their own profiles. OPTIONS: -l LEN, --read-len=LEN Generate reads having a length of LEN. Default: 100 -x VAL, --coverage=VAL Set the average sequencing coverage (sometimes called depth). It may be either a floating-point number or an integer. -m LEN, --insert-len-mean=LEN Generate inserts (fragments) having an average length of LEN. Default: 180 -v LEN, --insert-len-sd=LEN Set the standard deviation of the insert (fragment) length. Default: 10% of insert length mean. -j, --jumping, --cyclicize Make the paired-end reads face away from either other, as in a jumping library. Default: the reads face towards each other. -d, --diploid This option asserts that reads are being simulated from a diploid genome. It causes the program to abort if there are not exactly two reference sequences; in addition, the coverage is divided in half, since the two reference sequences are in reality the same genome. This option is not required to simulate diploid reads, but you must set the coverage correctly otherwise (it will be half as much as you think). -B FILE, --base-calling-profile=FILE, --subst-error-profile=FILE Use FILE as the base-calling profile. This profile will be used to simulate substitution errors. Default: PREFIX/share/pirs/Base-Calling_Profiles/humNew.PE100.matrix.gz -I FILE, --indel-error-profile=FILE, --indel-profile=FILE Use FILE as the indel-error profile. This profile will be used to simulate insertions and deletions in the reads that are artifacts of the sequencing process. Default: PREFIX/share/pirs/InDel_Profiles/phixv2.InDel.matrix -G FILE, --gc-bias-profile=FILE, --gc-content-bias-profile=FILE Use FILE as the GC content bias profile. This profile will adjust the read coverage based on the GC content of fragments. Defaults: PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_100.dat, PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_150.dat, PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_200.dat, depending on the mean insert length. -e FILE, --error-rate=RATE, --subst-error-rate=RATE Set the substitution error rate. The base-calling profile will still be used, but the average frequency of errors will be changed to RATE. Set to 0 to disable substitution errors completely. In that case, the base-calling profile will not be used. Default: default error rate of base-calling profile. Note: since pIRS parameterizes the error rate by several parameters, it is very difficult to determine exactly what needs to be done to make the error rate be a given value. We try to adjust the probabilities of getting each quality score in order to accomodate the user-supplied error rate. However, depending on your input sequences, the actual error rate simulated by pIRS could be off by 20% or more. Please check the informational output to see the final error rate that was actually simulated. -A ALGO, --substitution-error-algorithm=ALGO, --subst-error-algo=ALGO Set the algorithm used for simulating substitition errors. It may be set to the string "dist" or "qtrans". The default is "qtrans". The "dist" algorithm looks up the substitution error rate for each base pair based on the current cycle and the true base. This lookup produces a quality score and a called base that may or may not be the same as the true base. In the base-calling profile, the matrix we use is marked as the [DistMatrix]. The "qtrans" algorithm is a Markov-chain model based on the previous quality score and current cycle. The next quality score is looked up with a certain probability based on these parameters. The matrix used for this is marked as [QTransMatrix] in the base-calling profile. Then, the the DistMatrix is used to find a called base for the quality score. The DistMatrix is also used to call the base in the first cycle. -M MODE, --mask=MODE, --eamss=MODE Use the EAMSS algorithm for masking read quality. MODE may be the string "quality" or "lowercase". The EAMSS algorithm identifies low-quality, GC-rich regions near the ends of reads. "quality" mode will change the quality scores on these regions to (2 + quality_shift), while "lowercase" mode will change the base pairs to lower case, but not change the quality values. Default: Do not use EAMSS. -Q VAL, --quality-shift=VAL, --phred-offset=VAL Set the ASCII shift of the quality value (usually 64 or 33 for Illumina data). Default: 33 --no-quality-values --fasta Do not simulate quality values. The simulated reads will be written as a FASTA file rather than a FASTQ file. Substitution errors may still be done; if you do not want to simulate any substition errors, provide --error-rate=0 or --no-substitution-errors. --no-subst-errors --no-substitution-errors Do not simulate substitution errors. Equivalent to --error-rate=0. --no-indels --no-indel-errors Do not simulate indels. The indel error profile will not be used. --no-gc-bias --no-gc-content-bias Do not simulate GC bias. The GC bias profile will not be used. -o PREFIX, --output-prefix=PREFIX Use PREFIX as the prefix of the output files. Default: "pirs_reads" -c TYPE, --output-file-type=TYPE The string "text" or "gzip" to specify the type of the output FASTQ files containing the simulated reads of the genome, as well as the log files. Default: "text" -z, --compress Equivalent to -c gzip. -n, --no-logs, --no-log-files Do not write the log files. -S SEED, --random-seed=SEED Use SEED as the random seed. Default: time(NULL) * getpid(). Note: If pIRS was not compiled with --disable-threads, each thread actually uses its own random number generator that is seeded by this base seed added to the thread number; also, if you need pIRS's output to be exactly reproducible, you must specify the random seed as well as use only 1 simulator thread (--threads=1, or configure with --disable-threads, or run on system with 4 or fewer processors). -t, --threads=NUM_THREADS Use NUM_THREADS threads to simulate reads. This option is not available if pIRS was compiled with the --disable-threads option. Default: number of processors minus 2 if writing uncompressed output, or number of processors minus 3 if writing compressed output, or 1 if there are not this many processors -q, --quiet Do not print informational messages. -h, --help Show this help and exit. -V, --version Show version information and exit. 4 Examples ============== 4.1 Simulating a diploid genome sequence. Example command line: pirs diploid Human_ref.fa -s 0.001 -R 2 -d 0.00001 -v 0.000001 -o Human >simulate_seq.o 2>simulate_seq.e Output files: a) Human.snp.indel.invertion.fa: another diploid genome sequence. b) Human_indel.lst: InDel information list. c) Human_snp.lst: SNP information list. d) Human_invertion.lst: invertion information list. e) simulate_seq.o, simulate_seq.e: records of the program running information. 4.2 Simulate Illumina paired-end reads from a haploid genome. Example command line: pirs simulate Human_ref.fa -m 170 -l 90 -x 5 -v 10 -o Human >simulate_170.o 2>simulate_170.e Output files: a) Human_90_170_1.fq, Human_90_170_2.fq: the paired-end read files. b) Human_90_170.error_rate.distr: the error distribution file. c) Human_90_170.insert_len.distr: the insert length distribution file. d) Human_90_170.read.info: information about every simulated reads e) simulate_170.o, simulate_170.e: records of the program running information. 4.3 Simulate Illumina paired-end reads from a diploid genome. Example command line: pirs simulate --diploid Human_ref.fa Human.snp.indel.invertion.fa.gz -m 800 -l 70 -x 5 -v 10 -o Human >simulate_800.o 2>simulate_800.e Output files: a) Human_70_800_1.fq, Human_70_800_2.fq: the pair-end read files. b) Human_70_800.error_rate.distr: the error distribution file. c) Human_70_800.insertsize.distr: the insert size distribution file. d) Human_70_800.read.info.gz: record the information of every reads. e) simulate_800.e, simulate_800.o: records of the program running information. 5 Output file format ==================== a) *.fq/*.fq.gz FASTQ files containing the simulated reads. The files are given names similar to PREFIX_70_800_1.fq, where PREFIX is the prefix provided by the -o option (default: "pirs_reads"), 70 is the read length, 800 is the mean insert length, and 1 means the file for read 1 of the read pairs. These files will be written as GZIP files with the .gz suffix if you provide the `-c gzip' option. @read_800_21/1 ACGGAAAAGTTACGCTATCGCATGCGTGTAAGAACACTGCTCCTACGCCCATTTTATCGATGGCGCCCAG + egcggdggfgfgggggfeggggYbcgegfgggbggg^e]egfegggfbSeggdggegg`^eJgggcbEeb @read_800_22/1 CACGGGGGGACTTTATTTAATGAGCGGCTGTAACTTGGTCCGTCGTTTGAGAGGGGACACCTCATATGAT + gggggegcgeggggcgfcgc_gf_ggfefcgVgegcfcdgf`geggdd[ge`ggafeggggdgdgee^gg b) *.read.info Information about each simulated read. Column 1: reads ID. Column 2: Name of the reference file; this is mainly provided so that you tell which chromosome set a read came from if you simulate reads from a diploid genome produced with the `pirs diploid' command. Column 3: FASTA/FASTQ sequence ID of the contig/scaffold/chromosome. Column 4: position(1-based) in chromosome. Column 5: "+" forward direction; "-" reverse direction. Column 6: real insert size. Column 7: length of read-end masking by EAMSS algorithm. Column 8: read-position(1-based) of substitution error and raw base(->)error base. Column 9: read-position(1-based) of insertion error and the base of insertion. Column 10: read-position(1-based) of deletion error and the base of deletion c) *_snp.lst For example: I 3 3 G A I 45342 45355 C T I 104775 104680 C T ..... Column 1: chromosome sequence ID. Column 2: position(1-based) of SNP in reference chromosome. Column 3: position(1-based) of SNP in simulated diploid chromosome. Column 4: base of reference chromosome sequence. Column 5: the base of SNP. d) *_indel.lst For example: I - 3 1 C IV + 5 2 AC ..... Column 1: chromosomesequence ID. Column 2: "-" Deletion; "+" Insertion. Column 3: position(1-based) of InDel in the reference chromosome sequence. For deletions, it is the position of the first deleted base. For insertions, it is the position of the reference base corresponding to the base directly before the insertion. Column 4: position(1-based) of InDel in the diploid sequence. For deletions, it is the position of the diploid sequence base corresponding to the reference base directly before the deletion. For insertions, it is the position of the first inserted base. Column 5: length of InDel (always positive). Column 6: sequence of the InDel. I - 3 3 1 C Ref: a t G c a // 3 is the position of G base in the chromosome base. : a t G - a // following the G base is a deletion of "C" base. IV + 5 5 2 A C Ref: t t g c A - - g t t// 5 is the position of A base in the chromosome base. : t t g c A A C g t t// following the A base is a insertion of "AC" bases. e) *_inversion.lst For example: I 50191 50195 100 I 948984 948903 200 Column 1: chromosome sequence ID. Column 2: position(1-based) of beginning of inversion in the reference chromosome sequence. Column 3: position(1-based) of beginning of inversion of the diploid chromosome sequence. Column 3: length of inversion. 6 Notes ====== a) pIRS does not simulate reads containing "N" char. If your input reference contains the "N" character, reads generated in this in this region will be discarded. b) When running a simulation without the --no-subst-errors option, the maximum length of the simulated reads depends on the number of cycles recorded in the base-calling profile. The user must set the read length to no more than half the number of cycles recorded in the base-calling profile. c) When masking quality values, the program uses the same EAMSS algorithm from CASAVA v1.8.0. This is only done if the --eamss option is supplied. d) The program parses one chromosome at a time. Reads are evenly distributed among the chromosomes. e) To re-do a simulation and get the exact same results, you must set the random seed with the -S parameter. In addition, you must use the single-threaded version of pIRS, or else set --threads=4 (for only one simulator thread). f) pIRS will shift quality values when the substitution-error rate setting by user is different from that in profile. The quality score in the output data range from 2 to the maximum score of profile. You can find the real substitution-error rate of output data in file *.error_rate.distr. For update & support, please refer to http://code.google.com/p/pirs/ .