Sondovač 1.2 Basic help Sondovač (English pronunciation is "Sondovach". The word is a Czech neologism meaning something like "The Prober" or "The Probe Maker") is a script to create orthologous low-copy nuclear probes from transcriptome and genome skim data for target enrichment. Script summary ================================================================================ Phylogenetics benefits from using a large number of putatively independent nuclear loci and their combination with other sources of information, such as the plastid and mitochondrial genome. Selecting such orthologous low-copy nuclear (LCN) loci is still a challenge for non-model organisms. In recently published phylogenies based on target enrichment of several hundred LCN genes, these loci were selected from transcriptomes, genomes, gene expression studies, the literature, or a combination of these sources. Automated bioinformatic pipelines for the selection of LCN genes are, however, largely absent. We created a user-friendly, automated and interactive script named Sondovač to design LCN loci by a comparison between transcriptome and genome skim data. The script is licensed under open-source license GPL v.3 allowing further modifications. It runs on major Linux distributions and Mac OS X. Strong bioinformatics skills and access to high-performance computer clusters are not required; Sondovač runs on a standard desktop computer equipped with modern CPU like Intel i5 or i7. Pipeline - how the data are processed ================================================================================ A transcriptome assembly and paired-end genome skim raw data are combined to get hundreds of orthologous LCN loci. Enrichment of multi-copy loci is minimized by using unique transcripts only, which are obtained by comparing all transcripts and removing those sharing ≥90% sequence similarity using BLAT. Before matching the genome skim data against those unique transcripts, reads of plastid (and mitochondrial) origin are removed with Bowtie 2, SAMtools and bam2fastq utilizing reference sequences. Paired-end reads are subsequently combined with FLASH. These processed reads are matched against the unique transcripts sharing ≥85% sequence similarity with BLAT. Transcripts with >1000 BLAT hits (indicating repetitive elements) and BLAT hits containing masked nucleotides are removed before de novo assembly of the BLAT hits to larger contigs with Geneious, using the medium sensitivity / fast setting. After assembly, only those contigs that comprise exons of a minimum bait length (usually ≥120 bp in case of probe design for phylogenies) and have a certain minimum total locus length (multiple of the bait length, should not be too short in order to obtain sufficient phylogenetically informative signal; we recommend at least ≥600 bp) are retained. To ensure that probes do not target multiple similar loci, any probe sequences sharing ≥90% sequence similarity are removed using cd-hit-est, followed by a second filtering step for contigs containing exons of a minimum bait length and totaling minimum loci length (see comments above). To ensure that plastid sequences are absent from the probes, the probe sequences are matched against the plastome reference sharing ≥90% sequence similarity with BLAT and the hits removed from the probe set. The workflow of Sondovač is summarized in Figure 1 of the PDF manual. The steps of Sondovač are consecutively numbered to aid comprehension. Sondovač has three parts: two script parts and an intermediate part using Geneious. The workflow is as follows: A) sondovac_part_a.sh: Covers steps 1 to 6. 1) Removal of transcripts sharing ≥90% sequence similarity. 2) Removal of reads of plastid origin. 3) Removal of reads of mitochondrial origin (optional). 4) Combination of paired-end reads. 5) Matching of the unique transcripts and the filtered, combined genome skim reads sharing ≥85% sequence similarity. 6) Filtering of BLAT output: 6.1) Choice of transcript or genome skim sequences for further processing. 6.2) Removal of transcripts with >1000 BLAT hits. 6.3) Removal of transcript or genome skim BLAT hits [depending on the selection in (6.1)] containing masked nucleotides. Input files for sondovac_part_a.sh are FASTA transcriptome data, FASTQ paired-end genome skim reads and a plastome (and possibly also mitochondriome) reference. The input file for Geneious is the output of sondovac_part_a.sh. B) Geneious: Covers step 7 (see below and PDF manual) 7) De novo assembly of the transcript or genome skim BLAT hits [depending on the selection in (6.1)] to larger contigs. Note that you need a copy of Geneious for this step. The output files of Geneious are input files for sondovac_part_b.sh. C) sondovac_part_b.sh: Covers steps 8 to 11. 8) Retention of those contigs that comprise exons ≥ bait length and have a certain total locus length. 9) Removal of probe sequences sharing ≥90% sequence similarity. 10) Retention of those contigs that comprise exons ≥ bait length and have a certain total locus length. 11) Removal of probe sequences sharing ≥90% sequence similarity with the plastome reference. The output file of sondovac_part_b.sh is the final list of probes, from which the user needs to remove the putative plastid sequences indicated in step 11 of the pipeline. Software dependencies and installation ================================================================================ Requirements to run Sondovač -------------------------------------------------------------------------------- Sondovač uses several scientific software packages (namely bam2fastq, BLAT, Bowtie2, CD-HIT, FASTX toolkit, FLASH, Geneious, htsjdk, libgtextutils, Picard, and SAMtools - see required versions and links below), and basic UNIX tools (see below). Sondovač will check if those programs are installed - available in the PATH (i.e. if the shell application can locate and launch respective binaries). If you have those packages installed (in current versions), ensure that their binaries are in PATH. This should not be a problem for basic tools available in any UNIX-based operating system, as basic installation usually contains all needed tools. If you lack some of the required tools, the script will notify you, and you will have to install them manually. If this is needed, check the documentation for your operating system. If required programs are not installed, Sondovač will offer you installation. You can use precompiled binaries available together with the script (this is the recommended option) or (sometimes) from the web. In case you would like to compile required software yourself, the script will guide you through this process. This is recommended only for advanced users, as compilation might sometimes be very tricky. Users of Mac OS X can install those applications also using Homebrew (see http://brew.sh/). For compilation you need Apache Ant, GNU G++, GNU GCC, GIT, Java/OpenJDK, libpng developmental files, and zlib developmental files. Ensure you have those tools available - they should be readily available for any UNIX-based operating system. The following UNIX tools are required to run Sondovač. They are usually readily available in UNIX systems (but see note for Mac OS X below), so there is usually no need to install them manually. The tools are awk, bc, bunzip2, cat, cp, curl or wget, cut, dirname, dos2unis, echo, egrep, cd, g++, gcc, grep, gunzip, join, less, lsb_release or python (for Linux), make, mkdir, paste, perl, pkg-config, pwd, sed, sort, tar, tr, uname, uniq, unzip, wc. sondovac_part_a.sh requires (and will install) the following software packages: * BLAT * Bowtie2 * SAMtools * bam2fastq (will be replaced by Picard in a future release) * FLASH * FASTX-toolkit sondovac_part_b.sh requires (and will install) the following software packages: * CD-HIT * BLAT For Mac OS X users, Homebrew (http://brew.sh/ and https://github.com/Homebrew/) will be installed by the script, and it will install (new software or newer versions) Apache Ant, BASH (the shell interpreter), GNU AWK, GNU coreutils, GNU GCC, git, GNU grep, GNU make, pkg-config, GNU sed, and wget. Mac OS X is missing some tools and for others (typically sed, grep or awk) contains outdated BSD versions. The script will guide the user through the process, and it is possible to safely and easily remove these tools afterwards if the user wishes to do so. See the PDF manual for details about tools required by Sondovač and their manual installation. For most users it should be sufficient to be guided by the script to install needed tools automatically. When Sondovač starts, a directory "bin" is created in the current working directory. Sondovač saves binaries of required software packages in this directory (if they are not available). The user can then add this directory to PATH, move or delete it afterwards. See the PDF manual for details. Geneious -------------------------------------------------------------------------------- Geneious is a DNA alignment, assembly, and analysis software and one of the most common software platforms used in genomics. It is utilized for de novo assembly in Sondovač. We plan to replace it by a free open-source command line tool in a future release of Sondovač. Visit http://www.geneious.com/ for download, purchase, installation and usage of Geneious. After the input data are processed (interactively or not) by sondovac_part_a.sh, the user must process its output manually by Geneious according to the instructions given below. The output of Geneious is then processed by sondovac_part_b.sh, which produces the final probe set. Geneious was tested with versions 6, 7 and 8. Import the output file of part A of the script (sondovac_part_a.sh): go to menu File | Import | From File... This file is named as: *_blat_unique_transcripts_versus_genome_skim_data-no_missing_fin.fsa Select the file and go to menu Tools | Align / Assemble | De Novo Assemble. In "Data" frame select "Assemble by 1st (...) Underscore". In "Method" frame select Geneious Assembler (if you don't have other assemblers, this option might be missing) and "Medium Sensitivity / Fast" Sensitivity. In "Results" frame check "Save assembly report", "Save list of unused reads", "Save in sub-folder", "Save contigs" (do not check "Maximum") and "Save consensus sequences". Do not trim. Otherwise keep defaults. Run it. Geneious may warn about possible hanging because of big file size. Do not use Geneious for other tasks during the assembly. Running Geneious may take a long time. Select all resulting contigs (typically named "* Contig #") and export them (go to menu File | Export | Selected Documents...) as "Tab-separated table values (*.tsv)". Save the following columns (Hold Ctrl key to mark more fields): "# Sequences", "% Pairwise Identity", "Description", "Mean Coverage", "Name" and "Sequence Length". If this option would be inaccessible for you, export all columns. Warning! Do not select and export "* Consensus Sequences", "* Unused Reads" or "* Assembly Report" - only the individual "* contig #" files. Select items "Consensus Sequences" and "Unused Reads" and export them as one FASTA. Go to menu File | Export | Selected Documents... and choose FASTA file type. Use the exported files from Geneious as input for part B of the script (sondovac_part_b.sh). Command-line parameters to run Sondovač ================================================================================ General parameters: Shared by sondovac_part_a.sh as well as sondovac_part_b.sh. -h, -v Print help message and exit. -u Check for updates. If there is newer version of Sondovač available on https://github.com/V-Z/sondovac/releases/, download of the newer version will be offered to the user. -l Display LICENSE for license information (this script is licensed under GNU GPL v.3, other software under variable licenses). Exit viewing by pressing the "Q" key. -r Display README (this file) for detailed usage instructions. Exit viewing by pressing the "Q" key. More information is available in the PDF manual. -p Display INSTALL for detailed installation instructions. Exit viewing by pressing the "Q" key. More information is available in the PDF manual. -e Display detailed citation information and exit. See the PDF manual for more information. -o Set name of output files. Output files will start with that name. Do not use spaces or special characters - some software can not handle them correctly. Default value (if the user does not provide another name) is "output". See below for the list of produced output files. -i Running in interactive mode - the script will on-demand ask for the required input files, installation of missing software etc. This is the recommended default value (the script runs interactively without explicitly using option "-n"). -n Running in non-interactive mode. The user must provide at least the required input files (see below). You can use only one of parameters "-i" or "-n" (not both of them). If script fails to find some of the required software packages, it will exit. This is recommended for batch or repeated analysis, on remote servers and for more advanced users. The user must be sure that all required software is installed (see INSTALL and PDF manual for details). Input files: Those parameters are required when running in non-interactive mode. The parameters are optional in default interactive mode. Please, use file names without spaces and without special characters. -f FILE Transcriptome input file in FASTA format. sondovac_part_a.sh -c FILE Plastome reference sequence input file in FASTA format. sondovac_part_a.sh, sondovac_part_b.sh Plastome reference sequences from taxa up to the same order of the studied plant group are suitable. See Shannon C K. Straub, Matthew Parks, Kevin Weitemier, Mark Fishbein, Richard C. Cronn and Aaron Liston; American Journal of Botany (2012) 99(2): 349-364, http://www.amjbot.org/content/99/2/349.short -m FILE Mitochondriome reference sequence input file in FASTA format (optional) sondovac_part_a.sh This step is optional, as plant mitochondrial genomes have largely variable sizes and high rearrangement rates. -t FILE Paired-end genome skim input file in FASTQ format (first file). sondovac_part_a.sh -q FILE Paired-end genome skim input file in FASTQ format (second file). sondovac_part_a.sh -x FILE Input file in TSV format (output of Geneious assembly). sondovac_part_b.sh -z FILE Input file in FASTA format (output of Geneious assembly). sondovac_part_b.sh Optional parameters: See chapter "Pipeline" for steps referred here. If those parameters are not provided, the default values are used, and it is not possible to change them at a later point (not even in interactive mode). -a ### Maximum overlap length expected in approximately ≥90% of read pairs (parameter -M of FLASH, see its manual for details). FLASH can not combine paired-end reads that do not overlap by at least 10 bp (default minimum overlap length). Step 4 of Sondovač, sondovac_part_a.sh. DEFAULT: 65 OPTIONS: Integer ranging from 10 to 300 -y ## Sequence similarity between unique transcripts and the filtered, combined genome skim reads (parameter -minIdentity of BLAT, see its manual for details). Filtering for orthologs, using sequence similarity as criterion. Step 5 of Sondovač, sondovac_part_a.sh. DEFAULT: 85 (highly recommended) OPTIONS: Integer ranging from 70 to 100 -g Choice of transcript or genome skim sequences for further processing. Depending on the phylogenetic depth that should be obtained, the probe sequences need to be designed from either the transcript or genome skim sequences, or it might not matter (if the taxa, from which the transcriptome and genome skim data were generated, are closely related). Step 6.1 of Sondovač, sondovac_part_a.sh. DEFAULT: no usage of -g (genome skim sequences) OPTIONS: usage of -g (transcript sequences) -s #### Number of BLAT hits per transcript when matching unique transcripts and the filtered, combined genome skim reads. Transcripts with a high number of BLAT hits, indicating repetitive elements, need to be removed from the putative probe sequences. Step 6.2 of Sondovač, sondovac_part_a.sh. DEFAULT: 1000 OPTIONS: Integer ranging from 100 to 10000 -b ### Minimum exon (bait) length. The minimum exon length should not fall below the bait length in order to account for specific binding between genomic libraries and baits during hybridization. Steps 8 and 10 of Sondovač, sondovac_part_b.sh. DEFAULT: 120 (preferred length for phylogeny). OPTIONS: 80, 100, 120 -k ### Minimum total locus length. When running the script in interactive mode, the user will be asked which value to use. A table summarizing the total number of LCN loci, which will be the result of the probe design for all minimum total locus lengths that the user can select (600 bp, 720 bp, 840 bp, 960 bp, 1080 bp, 1200 bp), will be displayed to facilitate this choice. Steps 8 and 10 of Sondovač, sondovac_part_b.sh. DEFAULT: 600 OPTIONS: 600, 720, 840, 960, 1080, 1200 -d 0.## Sequence similarity between probe sequences (parameter -c of cd-hit-est, see its manual for details). Probes that target multiple similar loci need to be removed. Step 9 of Sondovač, sondovac_part_b.sh. DEFAULT: 0.9 (highly recommended). OPTIONS: Decimal ranging from 0.85 to 0.95 -y ## Sequence similarity between probe sequences and plastome reference (parameter -minIdentity of BLAT, see its manual for details). Some plastid reads might not have been removed in step2; they should be removed by this step. Step 11 of Sondovač, sondovac_part_b.sh. DEFAULT: 90 (highly recommended) OPTIONS: Integer ranging from 70 to 100 Examples: The basic and most simple usage (running in interactive mode): ./sondovac_part_a.sh -i Specify some of the required input files, otherwise run interactively: ./sondovac_part_a.sh -i -f input.fa -t reads1.fastq -q reads2.fastq Running in non-interactive, automated mode: ./sondovac_part_a.sh -n -f input.fa -c referencecp.fa -m referencemt.fa \ -t reads1.fastq -q reads2.fastq Modify parameter "-a", otherwise run interactively: ./sondovac_part_a.sh -i -a 300 Run in non-interactive mode (parameter "-n") - in such case the user must specify all required input files (parameters "-f", "-c", "-m", "-t" and "-q"). Moreover, parameter "-y" is modified: ./sondovac_part_a.sh -n -f input.fa -c referencecp.fa -m referencemt.fa \ -t reads1.fastq -q reads2.fastq -y 90 Modifying parameter "-s". Note that the interactive mode -i is implicit and does not need to be specified explicitly: ./sondovac_part_a.sh -s 950 Input and output files ================================================================================ All names of input files and paths to them must be without spaces and without special characters (some software have difficulties handling these). Script sondovac_part_a.sh requires as input files: 1) Transcriptome input file in FASTA format. Note: For technical reasons, the labels of FASTA sequences must be unique numbers (no other characters). Sondovač will check the labels, and if they are not in an appropriate form, a copy of this input file with correct labels will be created. 2) Plastome reference sequence input file in FASTA format. 3) Paired-end genome skim input file in FASTQ format (two files - forward and reverse reads). 4) OPTIONAL: Mitochondriome reference sequence input file in FASTA format. This file is not required. Script sondovac_part_a.sh creates the following files: 1) *_renamed.fasta - If needed, copy of the transcriptome input file with the changed labels of the FASTA sequences (unique numbers corresponding to the line numbers in the original file) will be created. File 2) *_old_and_new_names.tsv then contains two columns: 1) the original sequence labels as in the user-provided transcriptome input file and 2) new sequence labels. This might be useful to trace back certain sequences/probes. 1) *_blat_unique_transcripts.psl - Output of BLAT (removal of transcripts sharing ≥90% sequence similarity). 3) *_unique_transcripts.fasta - Unique transcripts in FASTA format. 4) *_genome_skim_data_no_cp_reads* - Genome skim data without cpDNA reads. 5) *_genome_skim_data_no_cp_no_mt_reads* - Genome skim data without mtDNA reads - only if mitochondriome reference sequence was used. 6) *_combined_reads_co_cp_no_mt_reads* - Combined paired-end genome skim reads. 7) *_blat_unique_transcripts_versus_genome_skim_data.pslx - Output of BLAT (matching of the unique transcripts and the filtered, combined genome skim reads sharing ≥85% sequence similarity). 8) *_blat_unique_transcripts_versus_genome_skim_data.fasta - Matching sequences in FASTA. 9) *_blat_unique_transcripts_versus_genome_skim_data-no_missing_fin.fsa - Final FASTA sequences for usage in Geneious. Files 1-8 are not necessary for further processing by this pipeline, but may be useful for the user. The last file (9) is used as input file for Geneious in the next step. An asterisk (*) denotes the beginning of the output files' names specified by the user with parameter "-o". If the user does not select a custom name, default value ("output") will be used. Geneious requires as input the last output file of sondovac_part_a.sh (file 9: *_blat_unique_transcripts_versus_genome_skim_data-no_missing_fin.fsa). Output of Geneious are two exported files (see Geneious chapter): 1) Final assembled sequences exported as TSV. 2) Final assembled sequences exported as FASTA. Script sondovac_part_b.sh requires as input files: 1) Plastome reference sequence input file in FASTA format. 2) Assembled sequences exported from Geneious as TSV. 3) Assembled sequences exported from Geneious as FASTA. Script sondovac_part_b.sh creates the following files: 1) *_prelim_probe_seq.fasta - Preliminary probe sequences. 2) *_prelim_probe_seq_cluster_100.fasta - Unclustered exons and clustered exons with 100% sequence identity. 3) *_prelim_probe_seq_cluster_90.clstr - Unclustered exons and clustered exons with more than a certain sequence similarity (CLSTR file). 4) *_unique_prelim_probe_seq.fasta - Unclustered exons / exons with less than a certain sequence similarity. 5) *_similarity_test.fasta - Exons that comprise exons ≥ bait length and have a certain total locus length. 6) *_target_enrichment_probe_sequences.fasta - Probes in FASTA. 7) *_possible_cp_dna_genes_in_probe_set.pslx - In case of any BLAT hits, the user might need to manually remove these plastid probe sequences from *_target_enrichment_probe_sequences.fasta (the previous script outfile); the remaining ones are the final probe sequences in FASTA. An asterisk (*) denotes the beginning of the output files' names specified by the user with parameter "-o". If user does not select a custom name, default value ("output") will be used. By default, output files are created in the same directory from which Sondovač was launched. Output files can be saved in a custom directory by specifying an output directory together with parameter "-o": # Find current directory (e.g. /home/user): pwd # Launching Sondovač located in directory /home/user/sondovac and save output # to e.g. desktop (/home/user/Desktop): ./sondovac/sondovac_part_a.sh -o Desktop/MyFile # Sondovač will save software (if needed) in "bin" directory located in # directory from which it was launched, see it: ls bin/* # Output files are in desired directory, see them e.g. by: ls -lh Desktop/MyFile* Sample data ================================================================================ Together with the script, we provide the ZIP archive (1.8 GB) that contains example input files for running the script: Oxalis genome skim data as well as the Ricinus cpDNA and mtDNA reference sequences. See https://github.com/V-Z/sondovac/wiki/Sample-data for download of sample data. The package contains: 1) input2_Ricinus_communis_reference_plastid_genome.fsa - cpDNA reference (parameter -c), GenBank reference number NC_016736. 2) input3_J12_Oxalis_obtusa_J12_genome_skim_data_R1.fastq - paired-end genome skim data, file 1 (parameter -t). 3) input4_J12_Oxalis_obtusa_J12_genome_skim_data_R2.fastq - paired-end genome skim data, file 2 (parameter -q). 4) input5_Ricinus_communis_reference_mitochondrial_genome.fasta - mtDNA reference (parameter -m), GenBank reference number NC_015141. The transcriptome input file is unpublished data from G. K.-S. Wong et al. As soon as the data are published, we will post them on GitHub. Data can now be found under * http://www.onekp.com/ * http://www.onekp.com/samples/list.php * http://www.onekp.com/samples/single.php?id=JHCN The transcriptome FASTA file used for the probe design is named JHCN-SOAPdenovo-Trans-assembly.dnas.out and can be found under JHCN/Assembly/JHCN-SOAPdenovo-Trans-translated/. Information on access to data download is given in Matasci N, Hung L-H, Yan Z, Carpenter EJ, Wickett NJ, Mirarab S, Nguyen N, Warnow T, Ayyampalayam S, Barker M, Burleigh JG, Gitzendanner MA, Wafula E, Der JP, dePamphilis CW, Roure B, Philippe H, Ruhfel BR, Miles NW, Graham SW, Mathews S, Surek B, Melkonian M, Soltis DE, Soltis PS, Rothfels C, Pokorny L, Shaw JA, DeGironimo L, Stevenson DW, Villarreal JC, Chen T, Kutchan TM, Rolf M, Baucom RS, Deyholos MK, Samudrala R, Tian Z, Wu X, Sun X, Zhang Y, Wang J, Leebens-Mack J and Wong G K-S (2014) Data access for the 1,000 Plants (1KP) project. GigaScience 3:17, http://www.gigasciencejournal.com/content/3/1/17/ Record output of Sondovač ================================================================================ To record the whole output of Sondovač (regardless of used parameters), use utility "tee". This will produce a plain text output with everything printed to the screen. It can be useful for reference or exploration if something went wrong. Use it as follows: ./sondovac_part_a.sh | tee records.log You can use any command line arguments, script will behave as usual. The plain text file "records.log" will then contain all its output. License ================================================================================ The script is licensed under permissive open-source license allowing redistribution and modifications. Check file LICENSE for details and feel free to enhance the script. Citation ================================================================================ When using Sondovač please cite: Roswitha Schmickl, Aaron Liston, Vojtěch Zeisek, Kenneth Oberlander, Kevin Weitemier, Shannon C.K. Straub, Richard C. Cronn, Léanne L. Dreyer and Jan Suda Phylogenetic marker development for target enrichment from transcriptome and genome skim data: the pipeline and its application in southern African Oxalis (Oxalidaceae) Molecular Ecology Resources (2016) - early view available on-line http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12487/abstract Other questions not covered here and reporting problems ================================================================================ If you have a question or you encounter a problem, please see https://github.com/V-Z/sondovac/issues and feel free to ask any question and/or express any wish. The authors will do their best to help you. Software references ================================================================================ bam2fastq: * http://gsl.hudsonalpha.org/information/software/bam2fastq BLAT: * W. James Kent BLAT – the BLAST-like alignment tool Genome Research (2002) 12:656-664 http://genome.cshlp.org/content/12/4/656.short Bowtie2: * Ben Langmead and Steven L. Salzberg Fast gapped-read alignment with Bowtie 2 Nature Methods (2012) 9:357-359 http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html CD-HIT: * Weizhong Li, Lukasz Jaroszewski and Adam Godzik Clustering of highly homologous sequences to reduce the size of large protein databases Bioinformatics (2001) 17:282-283. http://bioinformatics.oxfordjournals.org/content/17/3/282.short * Weizhong Li, Lukasz Jaroszewski and Adam Godzik Tolerating some redundancy significantly speeds up clustering of large protein databases Bioinformatics (2002) 18:77-82 http://bioinformatics.oxfordjournals.org/content/18/1/77.short * Weizhong Li and Adam Godzik Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences Bioinformatics (2006) 22:1658-1659 http://bioinformatics.oxfordjournals.org/content/22/13/1658.short * Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li CD-HIT: accelerated for clustering the next generation sequencing data Bioinformatics (2012) 28:3150-3152 http://bioinformatics.oxfordjournals.org/content/28/23/3150.short * Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li CD-HIT Suite: a web server for clustering and comparing biological sequences Bioinformatics (2010) 26:680 http://bioinformatics.oxfordjournals.org/content/26/5/680.short * Beifang Niu, Limin Fu, Shulei Sun and Weizhong Li Artificial and natural duplicates in pyrosequencing reads of metagenomic data BMC Bioinformatics (2010) 11:187 http://www.biomedcentral.com/1471-2105/11/187 * Weizhong Li, Limin Fu, Beifang Niu, Sitao Wu and John Wooley Ultrafast clustering algorithms for metagenomic sequence analysis Briefings in Bioinformatics (2012) 13(6):656-668 http://bib.oxfordjournals.org/content/13/6/656.abstract FASTX toolkit: * A. Gordon and G. J. Hannon FASTX-Toolkit. FASTQ/A short-reads pre-processing tools 2010 http://hannonlab.cshl.edu/fastx_toolkit/ FLASH: * Tanja Magoč and Steven L. Salzberg FLASH: fast length adjustment of short reads to improve genome assemblies Bioinformatics (2011) 27(21):2957-2963 http://bioinformatics.oxfordjournals.org/content/27/21/2957.abstract Geneious * Matthew Kearse, Richard Moir, Amy Wilson, Steven Stones-Havas, Matthew Cheung, Shane Sturrock, Simon Buxton, Alex Cooper, Sidney Markowitz, Chris Duran, Tobias Thierer, Bruce Ashton, Peter Meintjes1, and Alexei Drummond Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data Bioinformatics (2012) 28(12):1647-1649 http://bioinformatics.oxfordjournals.org/content/28/12/1647 grab_syngleton_clusters.py * Kevin Weitemier, Shannon C K Straub, Richard C Cronn, Mark Fishbein, Roswitha E. Schmickl, Angela McDonnell, Aaron Liston Hyb-Seq: Combining target enrichment and genome skimming for plant phylogenomics Applications in Plant Sciences (2014) 2(9):1-7 http://www.bioone.org/doi/abs/10.3732/apps.1400042 Picard: * http://broadinstitute.github.io/picard SAMtools: * Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin and 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools Bioinformatics (2009) 25(16): 2078-2079 http://bioinformatics.oxfordjournals.org/content/25/16/2078.abstract * Heng Li A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data Bioinformatics (2011) 27(21): 2987-2993 http://bioinformatics.oxfordjournals.org/content/27/21/2987.abstract * Heng Li Improving SNP discovery by base alignment quality Bioinformatics (2011) 27(8): 1157-1158. http://bioinformatics.oxfordjournals.org/content/27/8/1157.short
uribe-convers/sondovac
Sondovač is a script to create orthologous low-copy nuclear probes from transcriptome and genome skim data for target enrichment. For latest version check "releases".
TeXNOASSERTION