simNGS -- software for simulating next-generation sequencing data ** Introduction simNGS is software for simulating "perfect" observations from Illumina sequencing machines using the statistical models behind the AYB base-calling software. Observations are perfect in the sense that they are generated according the model and do not incorporate any additional effects that may be present in real data ("dust", bubbles, merged clusters, sequence-heterogeneous clusters, etc). simNGS takes fasta format sequences and a file describing the covariance of noise between bases and cycles observed in an actual run of the machine, randomly generates noisy intensities representing the signals for the sequence at each cycle and calculates likelihoods for all possible base calls. ** Compilation simNGS is written for BSD-style Unices according to the C99 standard and should be relatively portable. Make files for both generic Linux and Mac Os-X in the the src/ directory. The BLAS and LAPack libraries for matrix operations and linear algebra must be installed and in the standard library path. To compile from source, the BLAS and LAPack header files (provided by the blas-devel and lapack-dev packages on most Linux systems) must also be installed. Compiling: cd src make -f Makefile.linux which should produce a binary "bin/simNGS". ** Usage Fasta format sequence are read from stdin and log-likelihoods for the sequences, after the addition of noise, are written to stdout in a format described in "Output format" below. Messages, progress indicators and a summary of errors in the generated data are written to stderr. simNGS accepts several command-line arguments to change its behaviour, brief descriptions of which are available by running 'simNGS --help'. More detailed descriptions are available in doc/parameters.txt cat sequences.fa | simNGS runfile > sequences.like Ambiguous characters are interpretated as "no base present" and treated as a null call: the intensities generated are just noise and do not have an additional component from the sequence, leading to a less bright cluster. Artifact such as bubbles and registration errors could be simulated by placing ambiguity characters in the sequence. Where the sequence to be called is shorter than the expected read length, either because the input was too short or deletion from the mutation model have shortened it, the sequence is padded with ambiguity characters for the remaining cycles. ** Example Produces likelihoods for sequences in test100.fa and outputs to test.like, treating the (single-ended) runfile "s_2_0005.runfile" as paired-end. cat test100.fa | bin/simNGS -o likelihood -p paired data/s_2_0005.runfile > test.like Description of runfile: > # 76 cycle PhiX data from Sanger, single-ended. Very good run > Treating single-ended model as paired-end. > Using seed 1264523667 > Finished generating 100 sequences > Summary of errors, calling by maximum likelihood > Cycle Count Phred lower, upper Count Phred lower, upper > 1: 8 10.97 ( 8.24, 13.86) 6 12.22 ( 9.04, 15.56) > 2: 4 13.98 ( 10.07, 18.05) 4 13.98 ( 10.07, 18.05) The final few lines are a per-cycle description of the raw rate if the bases are called by maximum likelihood. "Count" is the raw number of errors generated, so the proportion of bases at a cycle with an error is Count / Number of Sequences. Phred is the Phred-like "quality" score for the bases, and lower and upper are an 95% confidence for the Phred score generated by transforming a Wilson score interval. Output, trimmed and artificially split over several lines: > 2 5 1165 560 > 1.441031e+00 9.830551e-01 2.178242e+00 2.034257e+00 > 4.003852e+00 4.413710e+00 2.539974e+00 3.105548e+00 > ... more intensities > 2 5 1668 739 > ... more The lane was 2, the tile was 5. The coordinates of the first cluster are 1165,560 (randomly generated as if from a GAII machine [1], uniformly over the tile and not thinned to take into account merged clusters). The likelihoods are encoded as -log(likelihood), so smaller values are the more likely, and the difference between these values (a function of the likelihood-ratio statistic) is a measure of the relative confidence between two calls. No intensities are given for clusters which fail to pass the purity filter. ** Output format The output from simNGS is a simple format based on the "_int.txt" from the Illumina platform. Informally, the file consists of several lines, one line per sequence, with tab separated fields. The first four fields contain the tile, lane number and x & y coordinates of the cluster. There are a number of additional field, equal to the number of cycles, containing (minus) the log-likelihoods for the base-calls at cycle: four space-separated non-negative real numbers in C-style "%e" form. More formally, the grammar in Extended Backus-Naur form is: FILE ::= LINE + LINE ::= LANE , "\t" , TILE , "\t" , X , "\t" , Y , LOGLIKES * , EOL TILE ::= digit + LANE ::= digit + X ::= digit + Y ::= digit + LOGLIKES ::= "\t" , LOGLIKE , " " , LOGLIKE , " " , LOGLIKE , " " , LOGLIKE LOGLIKE ::= non-zero digit , "." , digit + , EXPO EXPO ::= "e" , ( "+" | "-" ) , digit + Paired-end runs are concatenated, so appear as a longer single-ended run with twice as many cycles. ** Runfile format The runfile describes how noise and cluster intensites was distributed in a real run of an Illumina machine, as estimated by AYB. It consists of: 1) One or more lines of free-text describing the run (each prefixed by a '#'). 2) One line containing the number of cyles and the shape and scale of the distribution of the brightness of clusters. 3) Several lines, one per row of the covariance matrix, containing the elements of that row. If the run was paired-ended, (3) is repeated for a second matrix, which should have the same dimensions as the first. The matrices in runfile are covariance matrices for the noise in the run and so must be symmetric positive-definite. [1] X \in {0, ... , 1794} Y \in {0, ... , 2047}