/changlab

Primary LanguagePythonOtherNOASSERTION

README file
===========

This repository contains code developed by the Chang lab at the
University of Texas Health Science Center at Houston.

The code is experimental and not thoroughly tested.  Very frequently,
there are changes to one part of the code that breaks other parts.
There are certainly going to be bugs and incompatibilities.  However,
we do try to code defensively so that problems lead to tracebacks and
errors, rather than incorrect results.  If you get an error, or
something doesn't seem to be working, PLEASE LET US KNOW.

Most of the code here is written in Python, although there are some C
code and R scripts as well.

Contact:
Jeffrey Chang
jeffrey.t.chang@uth.tmc.edu
http://changlab.uth.tmc.edu/


CONTENTS
========

README             This file.
USAGE              Instructions for using BETSY.
LICENSE            How this code is licensed.
Betsy/             BETSY expert system for Bioinformatics pipelines.
  samples/         Example files for BETSY file formats.
scripts/           Python scripts for use from the command line.
Rlib/              Library of R code.
arrayio/           Python library for reading gene expression matrices.
genomicode/        Miscellaneous Python modules.
queue/             EXPERIMENTAL.  Not used.
web2py/            EXPERIMENTAL.  Not used.
conda_install.sh   Installation with Miniconda.



INSTALL
=======

1.  To build and install this package, type:

        python setup.py build
        # If there are no errors:
        sudo python setup.py install

    This setup script will install the following python packages (in
    the default location for your setup,
    e.g. /usr/lib/python2.7/site-packages/):

        genomicode/
          bin/
        arrayio/
        Betsy/


2.  The genomicode/bin/ path (described in step 1) contains executable
    scripts, including those that run the BETSY system, such as:

        betsy_run.py
        betsy_manage_cache.py
        pybinreg.py

    Either symlink these files into a path that your shell will search
    for executable files, e.g:
        ln -s /usr/lib/python2.7/site-packages/genomicode/bin/betsy_run.py \
          /usr/local/bin/

    Or, add this to your search path, e.g.:
        export PATH="${PATH}:/usr/lib/python2.7/site-packages/genomicode/bin/"


3.  The setup script will also install two files in the user's home
    directory that will need to be configured:

        ${HOME}/.betsyrc
        ${HOME}/.genomicoderc

    Open these files in your favorite text editor and follow the
    instructions in the files to configure the system.  The setup.py
    installation script will try to configure this file automatically
    as much as possible.  If it cannot find a program, it will leave
    the option commented out.  If you install the program later,
    please uncomment the line and configure it according to the
    location of the program.

    This file will direct you to more external dependencies to
    install.  You can decide whether to install them depending on your
    needs (see SYSTEM REQUIREMENTS below).



SYSTEM REQUIREMENTS
===================

Because the packages here call many other tools, there are many
external dependencies, listed below.  It is likely that not all of
them will need to be installed for your use.  I recommend reviewing
these dependencies and installing just the ones that you may use.  You
can install additional ones later if you run into problems during your
analysis (e.g. you get an error message saying that something is
missing).

* Brad Chapman (https://github.com/chapmanb) contributed a script
  (conda_install.sh) that will install these dependencies using the
  Miniconda system (http://conda.pydata.org/docs/).  You can try using
  this to simplify the installation process.




- LINUX or OS X (or macOS) operating system.
  We regularly use this on these OS's and have not tried running on
  any other operating system.


- Python 2.7+ (and libraries)
  Python 3 is not supported.

  REQUIRED python libraries:
  zlib        Part of the Python Standard Library.
  numpy       http://www.numpy.org
              Needed for math functions.
  matplotlib  http://matplotlib.org
              Used for some plots.
  pygraphviz  https://pygraphviz.github.io
              For interfacing with GraphViz.
  rpy2        http://rpy2.bitbucket.org
              To interface to R.
  psutil      https://github.com/giampaolo/psutil

  REQUIRED libraries.  For reading and writing Excel files.
  xlwt       https://pypi.python.org/pypi/xlwt
  openpyxl   http://openpyxl.readthedocs.io/en/default/
  xlrd       https://pypi.python.org/pypi/xlrd
  arial10    https://github.com/
               juanpex/django-model-report/blob/master/model_report/arial10.py

  OPTIONAL python libraries.
  PIL        http://www.pythonware.com/products/pil/
             To generate plots, such as heatmaps.
             Configure with: freetype, jpeg, zlib, lcms.
  pandas     http://pandas.pydata.org
             Will speed up some file I/O if available.
             But otherwise not necessary.
  pysam      Some scripts use it for reading SAM/BAM files.


- R (+ libraries)
  R is used for almost everything, so please install.
  a.  Please instally the Python library rpy2, as described above.
  b.  Be sure R is compiled with support for generating png, tiff, and
      pdf plots.
  c.  Also, make sure that the "Rscript" program is installed.

  REQUIRED R libraries.
  Bioconductor  https://www.bioconductor.org/

  OPTIONAL R libraries.
  biomaRt       Part of bioconductor.  For annotating genes.
  GenePattern   http://software.broadinstitute.org/cancer/software/
                  genepattern/programmers-guide#_Using_GenePattern_from_R
                Some of the analyses are done in GenePattern.
  VennDiagram   https://cran.r-project.org/web/packages/VennDiagram/
                To generate Venn diagram plots.

  OPTIONAL R libraries for gene expression analysis.
  affy          Part of bioconductor.  For analyzing Affymetrix microarrays.
  oligo         Part of bioconductor.  For analyzing Illumina microarrays.
  limma         Part of bioconductor.
  DESeq2        Part of bioconductor.
  edgeR         Part of bioconductor.

  OPTIONAL R libraries for machine learning.
  e1071         https://cran.r-project.org/web/packages/e1071/index.html
  randomForest  https://cran.r-project.org/web/packages/randomForest/index.html


- MATLAB
  OPTIONAL.  Used mostly for microarray analysis.  If you intend to
  analyze microarray data, I highly recommend that you install this.
  Otherwise, you may not need it.


- Plotting tools
  GraphViz    http://www.graphviz.org
              REQUIRED if using BETSY.  Generates network graphs.
              Be sure to install the pygraphviz library (above).
  POV-Ray     http://www.povray.org
              OPTIONAL.  Makes plots in BinReg pathway analysis.


- SIGNATURE package.

  OPTIONAL.  This is used for gene expression pathway analysis.  To
  install it:
  1.  Go to http://www.bioinformatics.org/signature/
  2.  Download the files:
      SIGDB_111025.tgz
      SIGNATURE_161028.tgz
  3.  Uncompress both files.
      * Do not follow the INSTALL instructions in the SIGNATURE
        package unless you would like to use it from a GenePattern
        interface.
  4.  Configure the variables in the [SIGNATURE] section in your
      ~/.genomicoderc file.


- Miscellaneous genomics tools.
  Cluster 3.0  http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
               OPTIONAL.  For clustering gene expression data.
  ComBat       http://www.bu.edu/jlab/wp-assets/ComBat/Abstract.html
               OPTIONAL.  For gene expression normalization.
  DWD          http://marron.web.unc.edu/sample-page/marrons-matlab-software/
               OPTIONAL.  For gene expression normalization.
               Also called "BatchAdjust".


- Utilities for NGS analysis.
  These are only needed if processing next generation sequencing data.

  samtools   http://www.htslib.org
  bcftools   http://www.htslib.org
  Picard     https://broadinstitute.github.io/picard/
  vcftools   https://vcftools.github.io/index.html
  bedtools   http://bedtools.readthedocs.io/en/latest/
  gtfutils   http://ngsutils.org/modules/gtfutils/
  bam2fastx  From Tophat package.


- NGS alignment and processing algorithms.
  This section includes software for processing and aligning reads
  from next-generation sequencing data.

  trimmomatic  http://www.usadellab.org/cms/?page=trimmomatic
               Not alignment, but used to trim reads before alignment.
  FastQC       http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
               For checking quality of reads before alignment.

  bowtie       http://bowtie-bio.sourceforge.net/index.shtml
               Alignment algorithm.
  bowtie2      http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
               Alignment algorithm.
  BWA          http://bio-bwa.sourceforge.net
               Alignment algorithm.
  STAR         https://github.com/alexdobin/STAR
               RNA specific alignment algorithm.


- RNA-Seq Software.
  Please install if analyzing RNA-Seq data.

  Tophat    https://ccb.jhu.edu/software/tophat/index.shtml
            Estimates gene expression values.
  RSEM      http://deweylab.github.io/RSEM/
            Estimates gene expression values.
  HTSeq     http://www-huber.embl.de/HTSeq/doc/overview.html
            Counts reads.

  RSeQC     http://rseqc.sourceforge.net
            Quality control of RNA-Seq.
            Also be sure to download the gene models and ribosome RNA
            bed files.


- NGS Variant callers.
  Please install if doing variant calling from NGS data.

  GATK           https://software.broadinstitute.org/gatk/
                 Variant caller.  Includes other miscellaneous tools for
                 processing the alignments.
  Platypus       http://www.well.ox.ac.uk/platypus


  Somatic variant callers.  Install these for calling mutations on
  cancer genomes, compared against a background germline mutations.

  Mutect         http://archive.broadinstitute.org/cancer/cga/mutect
  Mutect2        Distributed with recent versions of GATK.
  Varscan        http://varscan.sourceforge.net
  SomaticSniper  http://gmt.genome.wustl.edu/packages/somatic-sniper/
  MutationSeq    http://compbio.bccrc.ca/software/mutationseq/
  MuSE           http://bioinformatics.mdanderson.org/main/MuSE
  Strelka        https://sites.google.com/site/strelkasomaticvariantcaller/home
  Pindel
  Indelocator    Distributed with recent versions of GATK.

  Radia          https://github.com/aradenbaugh/radia
                 For calling mutations from RNA and DNA simultaneously.
  blat           https://genome.ucsc.edu/FAQ/FAQblat.html#blat3
                 Needed for Radia caller.
  snpEff         http://snpeff.sourceforge.net
                 Needed for Radia caller.

  Annovar        http://annovar.openbioinformatics.org/en/latest/
                 For annotating variants.


- NGS ChIP-Seq programs.
  Please install if doing peak finding from ChIP-Seq data.

  MACS 1.4   http://liulab.dfci.harvard.edu/MACS/
  MACS 2     https://github.com/taoliu/MACS
  PeakSeq    http://info.gersteinlab.org/PeakSeq
  SPP        http://compbio.med.harvard.edu/Supplements/ChIP-seq/
  Homer      http://homer.salk.edu/homer/



TESTING
=======

There is not yet an automated system for doing regression testing.
However, there are some tests in Betsy/regr_test.sh that can be run.
You can check your output against that in Betsy/test_output.log.