README file =========== This repository contains code developed by the Chang lab at the University of Texas Health Science Center at Houston. The code is experimental and not thoroughly tested. Very frequently, there are changes to one part of the code that breaks other parts. There are certainly going to be bugs and incompatibilities. However, we do try to code defensively so that problems lead to tracebacks and errors, rather than incorrect results. If you get an error, or something doesn't seem to be working, PLEASE LET US KNOW. Most of the code here is written in Python, although there are some C code and R scripts as well. Contact: Jeffrey Chang jeffrey.t.chang@uth.tmc.edu http://changlab.uth.tmc.edu/ CONTENTS ======== README This file. USAGE Instructions for using BETSY. LICENSE How this code is licensed. Betsy/ BETSY expert system for Bioinformatics pipelines. samples/ Example files for BETSY file formats. scripts/ Python scripts for use from the command line. Rlib/ Library of R code. arrayio/ Python library for reading gene expression matrices. genomicode/ Miscellaneous Python modules. queue/ EXPERIMENTAL. Not used. web2py/ EXPERIMENTAL. Not used. conda_install.sh Installation with Miniconda. INSTALL ======= 1. To build and install this package, type: python setup.py build # If there are no errors: sudo python setup.py install This setup script will install the following python packages (in the default location for your setup, e.g. /usr/lib/python2.7/site-packages/): genomicode/ bin/ arrayio/ Betsy/ 2. The genomicode/bin/ path (described in step 1) contains executable scripts, including those that run the BETSY system, such as: betsy_run.py betsy_manage_cache.py pybinreg.py Either symlink these files into a path that your shell will search for executable files, e.g: ln -s /usr/lib/python2.7/site-packages/genomicode/bin/betsy_run.py \ /usr/local/bin/ Or, add this to your search path, e.g.: export PATH="${PATH}:/usr/lib/python2.7/site-packages/genomicode/bin/" 3. The setup script will also install two files in the user's home directory that will need to be configured: ${HOME}/.betsyrc ${HOME}/.genomicoderc Open these files in your favorite text editor and follow the instructions in the files to configure the system. The setup.py installation script will try to configure this file automatically as much as possible. If it cannot find a program, it will leave the option commented out. If you install the program later, please uncomment the line and configure it according to the location of the program. This file will direct you to more external dependencies to install. You can decide whether to install them depending on your needs (see SYSTEM REQUIREMENTS below). SYSTEM REQUIREMENTS =================== Because the packages here call many other tools, there are many external dependencies, listed below. It is likely that not all of them will need to be installed for your use. I recommend reviewing these dependencies and installing just the ones that you may use. You can install additional ones later if you run into problems during your analysis (e.g. you get an error message saying that something is missing). * Brad Chapman (https://github.com/chapmanb) contributed a script (conda_install.sh) that will install these dependencies using the Miniconda system (http://conda.pydata.org/docs/). You can try using this to simplify the installation process. - LINUX or OS X (or macOS) operating system. We regularly use this on these OS's and have not tried running on any other operating system. - Python 2.7+ (and libraries) Python 3 is not supported. REQUIRED python libraries: zlib Part of the Python Standard Library. numpy http://www.numpy.org Needed for math functions. matplotlib http://matplotlib.org Used for some plots. pygraphviz https://pygraphviz.github.io For interfacing with GraphViz. rpy2 http://rpy2.bitbucket.org To interface to R. psutil https://github.com/giampaolo/psutil REQUIRED libraries. For reading and writing Excel files. xlwt https://pypi.python.org/pypi/xlwt openpyxl http://openpyxl.readthedocs.io/en/default/ xlrd https://pypi.python.org/pypi/xlrd arial10 https://github.com/ juanpex/django-model-report/blob/master/model_report/arial10.py OPTIONAL python libraries. PIL http://www.pythonware.com/products/pil/ To generate plots, such as heatmaps. Configure with: freetype, jpeg, zlib, lcms. pandas http://pandas.pydata.org Will speed up some file I/O if available. But otherwise not necessary. pysam Some scripts use it for reading SAM/BAM files. - R (+ libraries) R is used for almost everything, so please install. a. Please instally the Python library rpy2, as described above. b. Be sure R is compiled with support for generating png, tiff, and pdf plots. c. Also, make sure that the "Rscript" program is installed. REQUIRED R libraries. Bioconductor https://www.bioconductor.org/ OPTIONAL R libraries. biomaRt Part of bioconductor. For annotating genes. GenePattern http://software.broadinstitute.org/cancer/software/ genepattern/programmers-guide#_Using_GenePattern_from_R Some of the analyses are done in GenePattern. VennDiagram https://cran.r-project.org/web/packages/VennDiagram/ To generate Venn diagram plots. OPTIONAL R libraries for gene expression analysis. affy Part of bioconductor. For analyzing Affymetrix microarrays. oligo Part of bioconductor. For analyzing Illumina microarrays. limma Part of bioconductor. DESeq2 Part of bioconductor. edgeR Part of bioconductor. OPTIONAL R libraries for machine learning. e1071 https://cran.r-project.org/web/packages/e1071/index.html randomForest https://cran.r-project.org/web/packages/randomForest/index.html - MATLAB OPTIONAL. Used mostly for microarray analysis. If you intend to analyze microarray data, I highly recommend that you install this. Otherwise, you may not need it. - Plotting tools GraphViz http://www.graphviz.org REQUIRED if using BETSY. Generates network graphs. Be sure to install the pygraphviz library (above). POV-Ray http://www.povray.org OPTIONAL. Makes plots in BinReg pathway analysis. - SIGNATURE package. OPTIONAL. This is used for gene expression pathway analysis. To install it: 1. Go to http://www.bioinformatics.org/signature/ 2. Download the files: SIGDB_111025.tgz SIGNATURE_161028.tgz 3. Uncompress both files. * Do not follow the INSTALL instructions in the SIGNATURE package unless you would like to use it from a GenePattern interface. 4. Configure the variables in the [SIGNATURE] section in your ~/.genomicoderc file. - Miscellaneous genomics tools. Cluster 3.0 http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm OPTIONAL. For clustering gene expression data. ComBat http://www.bu.edu/jlab/wp-assets/ComBat/Abstract.html OPTIONAL. For gene expression normalization. DWD http://marron.web.unc.edu/sample-page/marrons-matlab-software/ OPTIONAL. For gene expression normalization. Also called "BatchAdjust". - Utilities for NGS analysis. These are only needed if processing next generation sequencing data. samtools http://www.htslib.org bcftools http://www.htslib.org Picard https://broadinstitute.github.io/picard/ vcftools https://vcftools.github.io/index.html bedtools http://bedtools.readthedocs.io/en/latest/ gtfutils http://ngsutils.org/modules/gtfutils/ bam2fastx From Tophat package. - NGS alignment and processing algorithms. This section includes software for processing and aligning reads from next-generation sequencing data. trimmomatic http://www.usadellab.org/cms/?page=trimmomatic Not alignment, but used to trim reads before alignment. FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ For checking quality of reads before alignment. bowtie http://bowtie-bio.sourceforge.net/index.shtml Alignment algorithm. bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml Alignment algorithm. BWA http://bio-bwa.sourceforge.net Alignment algorithm. STAR https://github.com/alexdobin/STAR RNA specific alignment algorithm. - RNA-Seq Software. Please install if analyzing RNA-Seq data. Tophat https://ccb.jhu.edu/software/tophat/index.shtml Estimates gene expression values. RSEM http://deweylab.github.io/RSEM/ Estimates gene expression values. HTSeq http://www-huber.embl.de/HTSeq/doc/overview.html Counts reads. RSeQC http://rseqc.sourceforge.net Quality control of RNA-Seq. Also be sure to download the gene models and ribosome RNA bed files. - NGS Variant callers. Please install if doing variant calling from NGS data. GATK https://software.broadinstitute.org/gatk/ Variant caller. Includes other miscellaneous tools for processing the alignments. Platypus http://www.well.ox.ac.uk/platypus Somatic variant callers. Install these for calling mutations on cancer genomes, compared against a background germline mutations. Mutect http://archive.broadinstitute.org/cancer/cga/mutect Mutect2 Distributed with recent versions of GATK. Varscan http://varscan.sourceforge.net SomaticSniper http://gmt.genome.wustl.edu/packages/somatic-sniper/ MutationSeq http://compbio.bccrc.ca/software/mutationseq/ MuSE http://bioinformatics.mdanderson.org/main/MuSE Strelka https://sites.google.com/site/strelkasomaticvariantcaller/home Pindel Indelocator Distributed with recent versions of GATK. Radia https://github.com/aradenbaugh/radia For calling mutations from RNA and DNA simultaneously. blat https://genome.ucsc.edu/FAQ/FAQblat.html#blat3 Needed for Radia caller. snpEff http://snpeff.sourceforge.net Needed for Radia caller. Annovar http://annovar.openbioinformatics.org/en/latest/ For annotating variants. - NGS ChIP-Seq programs. Please install if doing peak finding from ChIP-Seq data. MACS 1.4 http://liulab.dfci.harvard.edu/MACS/ MACS 2 https://github.com/taoliu/MACS PeakSeq http://info.gersteinlab.org/PeakSeq SPP http://compbio.med.harvard.edu/Supplements/ChIP-seq/ Homer http://homer.salk.edu/homer/ TESTING ======= There is not yet an automated system for doing regression testing. However, there are some tests in Betsy/regr_test.sh that can be run. You can check your output against that in Betsy/test_output.log.