/SimSeq

An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for chimeric reads, and non-biotinylated fragment pull down in mate-pair libraries . Additionally the program provides the ability to model both site and base specific error, and scripts are provided to train this error model on real datasets. My hope in creating this program is to generate as realistic data as possible to assist in assessing the accuracy of genome assembly tools.

Primary LanguageCOtherNOASSERTION

SimSeqNBProject currently contains the net beans project with the initial source

getErrorProfile contains the c source for the program that generates the
error model for the sequence simulator. This program utilizes a subset
of the kent source library (http://hgwdev.cse.ucsc.edu/~kent/src/) which
is freely available for both commercial and non-commercial use 
(see README in "ThirdParty/kent" for more information)


examples contains some things to run this on. Right now there is
an example mitochondrial sequence, and an example error profile
from a 100 bp illumina GAIIx dataset. You can simulate any read
less than or equal to 100bp in length using that error profile.



usage: java -jar -Xmx2048m SimSeq.jar [required options] [options]
            Last Updated: 4.12.2011
 -1,--read1_length <arg>            Integer length of first read. Default:
                                    100
 -2,--read2_length <arg>            Integer length of second read.
                                    Default: 100
    --adapter1 <arg>                The first read illumina adapter
                                    sequence to use when the insert size
                                    is less than the read length.
                                    Currently only works on paired end
                                    simulation.
                                    Defalt:AGATCGGAAGAGCGGTTCAGCAGGAATGCCG
                                    AGACCG
    --adapter2 <arg>                The second read illumina adapter
                                    sequence to use when the insert size
                                    is less than the read length.
                                    Currently only works on paired end
                                    simulation.
                                    Defalt:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
                                    GT
 -d,--dip <arg>                     If diploid data desired, path to
                                    diploid file. (format: chrom pos(0
                                    based) altChar
    --debug                         Write debug info to stderr.
 -e,--error <arg>                   If simulated read error desired, path
                                    to read error file.
    --error2 <arg>                  If you desire a seperate error
                                    distribution to be applied to the
                                    second read, then provide a path to
                                    that error profile with this option
 -h,--help                          Print this usage message.
    --inf_id                        Output location information in the
                                    read ID. Currently this is only
                                    informative for paired-end data and
                                    note mate paired.
 -l,--insert_size <arg>             mean library insert size for either
                                    mate-paired or paired-end. Default:
                                    200
 -m,--mate_pair                     Perform mate-pair rather than paired
                                    end run.
    --mate_frag <arg>               If using a mate-pair library, what is
                                    your desired loop fragmentation size?
                                    Default: 500
    --mate_frag_stdev <arg>         If using a mate-pair library, what is
                                    your desired loop fragmentation size
                                    standard deviation? Default: 50
    --mate_pulldown_error_p <arg>   If using a mate-pair library, what is
                                    the probability that a read does not
                                    include the biotin marker? Default:
                                    0.3
 -n,--read_number <arg>             Integer number of reads you would like
                                    to sample. Default: 1000000
 -o,--out <arg>                     Filename for output sam file
                                    (REQUIRED)
 -p,--read_prefix <arg>             Prefix for simulated reads. Default:
                                    SimSeq_
    --phred64                       Output phred+64 quality string rather
                                    than phred+33
 -r,--reference <arg>               Reference genome sequence in
                                    uncompressed fasta format (REQUIRED)
 -s,--insert_stdev <arg>            mean library insert stdev for either
                                    mate-paired or paired-end. Default: 20
 -u,--duplicate_probability <arg>   probability of generating a duplicate.
                                    Default: 0.0

===Quick Start Example:===
==Prerequisits==
0.0. Download and install samtools: http://samtools.sourceforge.net/ and add executable to your $PATH
0.1. Download latest Picard jar file: http://sourceforge.net/projects/picard/files/, and extract .jar files somewhere that makes sense like $HOME/jars/, for the rest of this I will assume that all jar files are located in $HOME/jars/
==Generate and check simulated reads==
1. Download SimSeq.jar, error_profile.txt, and AlligatorMito.fa
2. put SImSeq.jar in your jar directory ($HOME/jars/ or something) and put error_profile.txt and AlligatorMito.fa in a folder where you want to work
3. Run `java -jar -Xmx2048m $HOME/jars/SimSeq.jar -1 100 -2 100 \
    --error hiseq_mito_default_bwa_mapping_mq10_1.txt \
    --error2 hiseq_mito_default_bwa_mapping_mq10_2.txt \
    --insert_size 3000 --insert_stdev 300 \
    --mate_pair --mate_frag 500 --mate_frag_stdev 50 \
    --mate_pulldown_error_p 0.3 --read_number 10000 \
    --read_prefix AMito_ --reference AlligatorMito.fa \
    --duplicate_probability 0.01 --out out.sam`
4. Convert sam output to bam: `samtools view -bS -T AlligatorMito.fa -t AlligatorMito.size  -o out.bam out.sam`
5. Sort the bam file: `samtools sort out.bam out.sorted`
6. Index the sorted bam file: `samtools index out.sorted.bam`
7. Check the alignment in tview: `samtools tview out.sorted.bam AlligatorMito.fa`
==Convert Sam output to Fastq==
run `java -jar -Xmx2048 $HOME/jars/SamToFastq.jar INPUT=out.sorted.bam FASTQ=library.1.fastq SECOND_END_FASTQ=library.2.fastq INCLUDE_NON_PF_READS=true VALIDATION_STRINGENCY=SILENT`

Note: setting the VALIDATION_STRINGENCY=SILENT is probably not ideal, but for now it is necessary because samtools isn't putting the length of the reference sequence into the header, which is expected by Picard. You get an error about the first read having a coordinate outside of the range of the reference sequence which has a length of 0. If you use Picard to somehow fix the header given the reference sequence this should work without telling it not to validate your sequence.

Also Note: The current sam alignment for a chimeric read shows the fragment starting at the left-most coordinate. Since the rest of the sequence occurs before the left most coordinate I use a soft clipping for the remainder of the sequence. This does not effect the output fastq file generated by the above method, but when you are looking at the alignment in samtools tview you will see some sequences as short as 1 nucleotide. This is normal, and doesn't mean that the actual sam records are that short.


When producing chimeric reads, samtools doesn't currently allow you to view the backwards jumps that occur. Doing so would involve a '-xN' operation on a cigar string. I added a custom tag ('YC:Z:[CIGAR]') at the end of each sam string that shows either a normal cigar string if the read can be represented with one, or a custom cigar string that allows for negative jumps needed to represent chimeric reads if the read is chimeric. Here is an example of what a chimeric read's tag would look like `YC:Z:97M-4996N3M` while a non-chimeric read would have a normal cigar string in its tag like: `YC:Z:100M`. To reconstruct a chimeric read, you start at the start coordinate for the read which corresponds to the left most coordinate, then you proceed for the number of matches 'M', then you jump the specified number of positions from your current location, in this case -4996, and finally proceed with the remaining CIGAR string. Right now since I don't add indels to the data, this string will either be a match or one of these chimeric jumps.