/nuc_dynamics

Particle dynamics and simulated annealing for chromosome structure calculation

Primary LanguagePython

NucDynamics
-----------

NucDynamics is a Python/Cython program for the calculation of genome structures
from single-cell Hi-C chromosome contact data using a simulated annealing
particle dynamics protocol. This software takes NCC format contact data (as
output by NucProcess) and creates 3D coordinates, which are output as N3D (a
simple tab-separated format, see below) or PDB (ProteinDataBank) format files.
If the result of the calculation is output as PDB format the structures may be
viewed in molecular graphics software such as PyMol. Further output formats will
be supported in the near future.

To run NucDynamics issue the 'nuc_dynamics' command line followed by the
name/location of an input NCC format contact file. Various options may be
specfied on the command line using flags that are prefixed with '-'. A full
listing of these is given below, but commonly -o (the output coordinate file),
-m (the number of conformational models to generate), -f (output file format)
and -s (the particle size or sizes to use) will be specified.

Parameters relating to the restraint distances are not normally changed when
calculating interphase genome structures. However, the annealing stage
temperatures and the number of temperature and dynamics steps may be adjusted
according to the amount and/or quality of the single-cell Hi-C contacts. The
default parameters are generally suitable for good data sets with at least
40,000 contacts but better results may be obtained with longer and more gentle
annealing, i.e. more temperature steps and dynamics steps.

It should be noted that obtaining high-quality whole-genome structures (with a
tight conformational bundle) is dependant on both having a good number of
contacts and having a reasonable proportion of inter-chromosomal (trans)
contacts. As a rough guide, around 40,000 total contacts with 2000 trans
contacts is about the lower limit for calculating resonable structures
(intra-bundle RMSD < 2 radii) at a 100 kb particle size. Though, having a higher
proportion of trans contacts can make up for shortfalls in the total count.

The quality of the output structure bundles naturally depends on the final
sequence resolution that is chosen, as specified by the particle size (-s
option). Though the default paramaters calculate to a final size of 100 kb, for
more sparse datasets larger sizes often give good results. Using finer sizes can
provide more detail, but will increase the number of particles that are not
restrained by at least one Hi-C derived contact, and thus the precision (RMSD)
at the particle scale will decrease. Though, with >100,000 contacts particle
sizes of 25 kb are reasonable. However, finer resolutions will naturally
increase memory requirements and calculation time. 

Multiple particle sizes can be specified so that the genome structure
calculation uses a hierarchical protocol, calculating a low resolution structure
first and then basing the next, finer resolution stage on the output of the
previous stage. Hierarchical particle sizes need not be used, but they make the
calculation more robust. 


Python Module Requirements
---------------------------

This software uses Python version 2 or 3 and requires that the Numpy and Cython
packages are installed and available to the Python version that runs
NucDynamics.

These modules are available in bundled Python packages like Anaconda or Canopy,
in most Linux distributions' package managers or can be installed on most
UNIX-like systems using pip:

  pip install numpy
  pip install cython


Installation
------------

NucDynamics does not require installation as such and may be run directly from
its download location, though all the component files must reside in the same
directory.

When first run, NucDynamics will attempt to compile the modules written in
Cython. A re-compilation may be forced by deleting the .so and .c files that
result from the compilation. The Cython code may also be compiled indepenently
using the setup_cython.py script as follows:

  python setup_cython.py build_ext --inplace


Running NucDynamics
-------------------

Typical use, generating 10 conformational models in PDB format:

  nuc_dynamics example_chromo_data/Cell_1_contacts.ncc -m 10 -f pdb


Specifying the particle sizes (8 Mb, 2 Mb, 1 Mb, 500 kb) and an output file
name:

  nuc_dynamics example_chromo_data/Cell_1_contacts.ncc -m 10 -f pdb -o Cell_1.pdb -s 8 2 1 0.5


Example Data
------------

Example NCC format contact data to demonstrate NucDynamics is avaiable in the
example_chromo_data sub-directory, as a .tar.gz archive which must be extracted
before use. 

N3D coordinate format
---------------------

The default N3D output file format for genomic 3D coordinate positions is a
simple whitespace-separated format consisting of blocks of lines for separate
chromosomes (or other named sequence segments). Each block consists of a header
line, giving the chrosome name, number of coordinates (i.e. number of particle
positions) and the number of alternative coordinate models. The subsequent
particle data lines for the chromosome contain the basebair sequence position
followed by cartesian (X, Y, Z) coordinates for each alternative model.

i.e. each block is arranged like:

  chr_name      num_coords    num_models
  seq_pos_a     xa1     ya1     za1     xa2     ya2     za2     xa3     ya3     za3      ...
  seq_pos_b     xb1     yb1     zb1     xb2     yb2     zb2     xb3     yb3     zb3      ...
  ...

For example the lines for two chromosomes, each with 5 positions/coordinates and
1 model could be:

  chr7    5       1
  3000000 9.58282628      7.00573100      -1.83845778
  3100000 9.59673638      5.97364070      -2.01971472
  3200000 10.96127666     5.81146425      -1.89757439
  3300000 10.62732797     4.85757116      -1.96395929
  3400000 10.12147618     5.03441764      -1.01074847
  chrX    5       1
  3100000 -4.90378489     5.82575335      -2.68593345
  3200000 -3.82554225     5.68513329      -2.78682360
  3300000 -2.86136185     5.20402428      -2.68921585
  3400000 -3.37860658     4.53744574      -2.09833147
  3500000 -4.05993187     3.72324210      -2.17562361
 

Command line options for nuc_dynamics
-------------------------------------

usage: nuc_dynamics [-h] [-o OUT_FILE] [-m NUM_MODELS] [-f OUT_FORMAT]
                    [-s Mb_SIZE [Mb_SIZE ...]] [-iso Mb_SIZE] [-pow FLOAT]
                    [-lower DISTANCE] [-upper DISTANCE] [-bb_lower DISTANCE]
                    [-bb_upper DISTANCE] [-ran INT] [-rad DISTANCE]
                    [-hot TEMP_KELVIN] [-cold TEMP_KELVIN] [-temps NUM_STEPS]
                    [-dyns NUM_STEPS] [-time_step TIME_DELTA]
                    NCC_FILE

Single-cell Hi-C genome and chromosome structure calculation module for Nuc3D
and NucTools

positional arguments:
  NCC_FILE              Input NCC format file containing single-cell Hi-C
                        contact data, e.g. use the demo data at
                        example_chromo_data/Cell_1_contacts.ncc

optional arguments:
  -h, --help            show this help message and exit
  -o OUT_FILE           Optional name of output file for 3D coordinates in N3D
                        or PDB format (see -f option). If not set this will be
                        auto-generated from the input file name
  -m NUM_MODELS         Number of alternative conformations to generate from
                        repeat calculations with different random starting
                        coordinates: Default: 1
  -f OUT_FORMAT         File format for output 3D coordinate file. Default:
                        "n3d". Also available: "pdb"
  -s Mb_SIZE [Mb_SIZE ...]
                        One or more sizes (Mb) for the hierarchical structure
                        calculation protocol (will be used in descending
                        order). Default: 8.0 4.0 2.0 0.4 0.2 0.1
  -iso Mb_SIZE          Contacts must be near another, within this (Mb)
                        separation threshold (at both ends) to be considered
                        supported: Default 2.0
  -pow FLOAT            Distance power law for combining multiple Hi-C
                        contacts between the same particles. Default: -0.33
  -lower DISTANCE       Lower limit for a contact-derived distance restraint,
                        as a fraction of the ideal distance. Default: 0.8
  -upper DISTANCE       Upper limit for a contact-derived distance restraint,
                        as a fraction of the ideal distance. Default: 1.2
  -bb_lower DISTANCE    Lower limit for sequential particle backbone
                        restraints, as a fraction of the ideal distance.
                        Default: 0.1
  -bb_upper DISTANCE    Upper limit for sequential particle backbone
                        restraints, as a fraction of the ideal distance.
                        Default: 1.1
  -ran INT              Seed for psuedo-random number generator
  -rad DISTANCE         Radius of sphere for random starting coordinates.
                        Default: 10.0
  -hot TEMP_KELVIN      Start annealing temperature in pseudo-Kelvin units.
                        Default: 5000
  -cold TEMP_KELVIN     End annealing temperature in pseudo-Kelvin units.
                        Default: 10
  -temps NUM_STEPS      Number of temperature steps in annealing protocol
                        between start and end temperature. Default: 500
  -dyns NUM_STEPS       Number of particle dynamics steps to apply at each
                        temperature in the annealing protocol. Default: 100
  -time_step TIME_DELTA
                        Simulation time step between re-calculation of
                        particle velocities. Default: 0.001


Jupyter Script
--------------

In addition to the command-line tool a Jupyter notebook for Python 3 is provided
to illustrate how nuc_dynamics can be imported used within Python scripts.
Jupyter can be installed using:

  pip install jupyter

And from the directory containing nuc_dynamics the notebook can be started with:

  jupyter notebook

This notebook is has only been tested under Python version 3 and to run requires
the Cython code to be compiled (see Installation section above), to generate the
dyn_util.so file, and for all the modules to either be in the same directory or
on the PYTHONPATH.