/MetaPhase

The software involved in the MetaPhase project, as described in G3 (http://dx.doi.org/10.1534/g3.114.011825)

Primary LanguageC++

MetaPhase User's Manual and Quick Start Guide

MetaPhase: A software tool for metagenomic deconvolution with Hi-C.

Created by Josh Burton (jnburton at uw.edu) in the Department of Genome Sciences at the University of Washington, Seattle, WA, USA

Publication in in G3: Genes | Genomes | Genetics (please cite) is here: http://dx.doi.org/10.1534/g3.114.011825

Table of Contents

A. INTRODUCTION
  1. What is MetaPhase, and why do I care?
  2. What do I need to have in order to use MetaPhase?
B. INSTALLATION AND QUICK START GUIDE
  1. System requirements
  2. Downloading the MetaPhase package
  3. Compiling the MetaPhase package
  4. Walkthrough: Try out MetaPhase on a sample dataset
C. METAPHASE INPUT FILES
  1. Input file list
  2. Creating the draft metagenome assembly
  3. Aligning the Hi-C reads to the draft metagenome assembly
  4. Supplying reference genomes
  5. Creating TSV files
D. RUNNING METAPHASE
  1. Required command-line arguments
  2. Optional command-line arguments
  3. Output command-line arguments (all optional toggles)
E. METAPHASE OUTPUT
  1. Charts
  2. Images
  3. Files in the output directory
  4. Running LACHESIS
F. TROUBLESHOOTING
  1. MetaPhase won't run!
  2. MetaPhase is crashing!
  3. MetaPhase is producing a weird result!

A. Introduction

A1. What is MetaPhase, and why do I care?

MetaPhase is a software tool to perform metagenomic deconvolution. That is, it inputs a metagenome assembly - an assembly created from a mixed genomic sample, usually of many different microbial species - and it determines which contigs in that assembly belong together in the same genomes. A metagenome assembly does not contain the complete genomes of any one species in the mixed sample, but the deconvoluted assembly can contain nearly complete genomes of many individual species. MetaPhase relies on data generated by Hi-C, an established molecular technique of studying chromatin conformation (paper).

You want to use MetaPhase if you are studying a microbial community and you want to know the genomes of individual microbial species. MetaPhase works well on fairly complex communities, and it can study eukaryotes and prokaryotes equally well. MetaPhase cannot deconvolute closely related strains of the same species; it will put all of these strains into a single cluster. MetaPhase has not been tested on communities with thousands of species, such as the human gut microbiome, nor can it reliably detect species with abundances well below 1%. (Note that the limitation here is in the inability of standard de novo metagenome assembly software to generate a draft metagenome assembly containing contigs from rare species, rather than the ability of MetaPhase to deconvolute these contigs.)

You do not want to use MetaPhase if you are studying strain variation within a single species. Or if you are studying genomic rearrangments in human cancer genomes. You also do not want to use MetaPhase unless you have a metagenomic Hi-C dataset or are willing to create one; Hi-C is not a trivial technique to learn. Purely computational (rather than molecular) metagenomic deconvolution is difficult, but not impossible: see the papers cited in the introduction to the MetaPhase paper.

A2. What do I need to have in order to run MetaPhase?

At a high level, you only need two things in order to run MetaPhase:

  1. A draft de novo metagenome assembly. This can be created from a shotgun metagenome sequencing library by any number of assembly tools, such as Velvet, IDBA-UD, ABySS, or SPAdes.
  2. A Hi-C sequencing library created from a metagenome sample - preferably the same sample that was used to create the de novo metagenome assembly, or a very similar one.

At a low level, MetaPhase requires several different input files, as well as some optional inputs. For more details about the input files, see section C1, "Input file list".


B. Installation and Quick Start Guide

B1. System requirements

To setup and run MetaPhase, you will need a computer running in a UNIX environment with at least 16GB of memory, with the following software installed:

You may also need the following software:

MetaPhase also requires the boost C++ libraries (http://www.boost.org/) and the the SAMtools toolkit (http://samtools.sourceforge.net/), but these are included with the MetaPhase installation package.

B2. Downloading the MetaPhase package

Download the MetaPhase package from http://shendurelab.github.io/MetaPhase/ into a UNIX filesystem. If you download the tarball (MetaPhase.tar.gz), unpack it using the following UNIX commands:

tar xzvf MetaPhase.tar.gz cd MetaPhase/

From here on, I am referring to the main MetaPhase directory as <MetaPhase>.

B3. Compiling the MetaPhase package

To compile MetaPhase, simply type make in the <MetaPhase> directory. In order to run MetaPhase, you may also need to add <MetaPhase>/include/boost_1_47_0/stage/lib to your $LD_LIBRARY_PATH (to avoid a problem: cannot open shared object file....) Lastly, make sure to either run MetaPhase from the <MetaPhase> directory or add that directory to your $PATH. This is important because some MetaPhase modules need to access the excutable scripts FastaSize, CountMotifsInFasta.pl, Fig2a.R, and MakeClusteringResultHeatmap.R, which are included in the MetaPhase package.

B4. Walkthrough: Try out MetaPhase on a sample dataset

The MetaPhase package includes a small test case that you can run to get a feel for how MetaPhase works. It's contained in the directory test_case, which has the following subdirectories:

  • <MetaPhase>/test_case/assembly/: Contains a draft de novo metagenome assembly, assembly.fasta. This assembly consists of the 20 contigs taken from a much larger assembly of a bacterial vaginosis sample. It serves here as a toy example of a metagenome assembly.
  • <MetaPhase>/test_case/HiC/: Contains 2 fastq files, BV.H3.head.bmt.1.fq and BV.H3.head.bmt.2.fq. These reads are a subset of a much larger Hi-C dataset sequenced from a bacterial vaginosis sample. They have already been filtered with bmtagger to remove human reads.
  • <MetaPhase>/test_case/refs/: Contains one publicly available reference genome, LI.fasta, for the bacterium Lactobacillus iners. This is an optional input that MetaPhase will use to see whether its clusters match the L. iners genome.
  • <MetaPhase>/test_case/tsvs/: Contains two TSV files that describe the location of other input files and are used by MetaPhase.
  • <MetaPhase>/test_case/out/: This directory does not exist initially. When MetaPhase runs on the test case, it will create this directory and put its output here.

The one (optional) input missing from the test_case is a BLAST database of nucleotide sequences. This database, which allows you to query the metagenome assembly's contigs against all known sequences, is far too large for a test package but can be downloaded from the BLAST website. To use this database, you will need to set the --blast_dir command-line argument.

To apply MetaPhase to the test_case, run the following commands:

  • Prepare the draft de novo metagenome assembly for alignment with bwa. Note that bwa must be in your $PATH. cd <MetaPhase>/test_case/assembly ../../FastaSize assembly.fasta bwa index -a bwtsw assembly.fasta
  • Align the Hi-C reads to the draft assembly. This uses align.sh, a script that is already provided, which runs bwa aln and bwa sampe, and creates a BAM file that MetaPhase will use. Note that you must use bwa aln and bwa sampe, not bwa mem. cd <MetaPhase>/test_case/HiC align.sh
  • Examine the TSV files to make sure you understand what they're doing. cd <MetaPhase>/tsvs cat test_case.refs.tsv cat test_case.HiC_libs.tsv
  • Now, run the MetaPhase test case with a basic set of command-line arguments. The purposes of all of these command-line arguments is explained below in section D, "Running MetaPhase". cd <MetaPhase> MetaPhase -s test_case -a test_case/assembly/assembly.fasta -i test_case/tsvs --refs_dir test_case/refs -o test_case/out -N 3 --isolated_component_size 2 --jarvis_patrick_K 2

The first thing MetaPhase will do is align the contigs of the draft assembly against the reference genome LI.fasta. This may take several minutes, but it's a one-time wait: the results will be cached in a special file. Next, MetaPhase will cluster the 20 contigs in the draft assembly by their Hi-C linkages, creating 3 clusters (because of -N 3.) Last, MetaPhase will report basic statistics about the clusters it has created.

Now try running MetaPhase again, adding one or more of the following command-line arguments: --report_unclustered, --output_cluster_fastas, --output_heatmaps, output_network_image. Each of these options will result in more information being output in various forms: either to the screen, or to files, or to images. See section D2, "Optional command-line arguments" for more information.

Now look in test_case/out/test_case. This is the output directory created by your MetaPhase run. It contains several output files. The files cluster.*.fasta are your cluster fastas (they exist only if you've run with --output_cluster_fastas.) The subdirectory cached_data contains cahed data files, which includes the results of BLAST runs and MetaPhase clusterings.

Note that the test_case is a very small dataset and its results are not biologically useful or typical. For example, there are so few Hi-C read pairs that 10 of 20 contigs are completely unlinked, and those that are linked are in three separate clusters (so it's impossible to produce fewer than 3 clusters.) This prevents us from illustrating another useful feature of MetaPhase, which is that we can use it to predict the number of clusters. On your sample (though, alas, not on the test_case) you can run MetaPhase with -N 1 and it will generate an E(N) enrichment curve, just like the one in Figure S4 of the MetaPhase paper. This will let you determine the rough number of species in your metagenome assembly, and thus the optimal number of clusters.


C. MetaPhase input files

C1. Input file list

MetaPhase uses the following input files directly. For an illustration of what all of these files look like, see the test_case.

Required files:

  • A draft metagenome assembly, in fasta format
  • One or more SAM/BAM files describing alignments of Hi-C reads to a draft metagenome assembly. Note that these SAM/BAM file(s) must have each read listed only once, which means they must be generated with bwa aln and bwa sampe, NOT bwa mem.
  • Two TSV files, <scenario>.HiC_libs.tsv and <scenario>.refs.tsv, which describe the set of BAM input files and the set of reference genomes, respectively. Optional files:
  • A BLAST database describing all known nucleotide (nt) and/or protein (nr) sequences. An updated version of this database can be downloaded from the BLAST website, which also contains instructions for how to install the blastn and tblastx command-line utilities that you will need. Note that these databases are large (as of 2015, nt is ~25 Gb and nr is ~50 Gb.)
  • A set of reference genomes in fasta format, describing species that you believe to be in your sample, or related to things in your sample. If you don't know everything in your sample (and you probably don't) then you can wait until you've already aligned your metagenome assembly with a BLAST search and then take suggestions from those search results. To find a reference genome assembly for a species, search the NCBI Assembly database.
  • A SAM/BAM file describing alignments of shotgun reads to a draft metagenome assembly. The shotgun reads are the same reads used to create the assembly. MetaPhase can use this file to estimate the abundance in your sample of each contig, and thus of each cluster. Currently not available without hacking MetaPhase.cc a little (sorry.)

C2. Creating the draft metagenome assembly

One of the most important inputs to MetaPhase is the draft de novo metagenome assembly. You must create this assembly yourself using shotgun reads from your sample. There are many publicly available de novo metagenome assembly tools that work quite well, including Velvet, IDBA-UD, ABySS, and SPAdes. I used IDBA-UD while developing MetaPhase.

It is important to realize that MetaPhase does not produce any new sequence; it only clusters sequence that is already in the assembly. If some sequence from your sample does not make it into the draft assembly - because it's too infrequent, too GC-unbalanced, too repetitive, or for any other reason - then MetaPhase cannot possibly cluster it into a genome. It may be worth trying out many different options in your metagenome assembler, or many different metagenome assembly tools, in order to get an assembly with the greatest amount of sequence and the longest contig N50.

MetaPhase performs much better when its input contigs are longer, because there's a clearer signal of Hi-C linkage for it to use. In particular, MetaPhase cannot cluster a contig that does not contain any restriction enzyme sites, because a Hi-C read can't reliably align to it. Keep this fact in mind when choosing what restriction enzyme to use for your Hi-C experiment. If your metagenome assembly has a small N50, you might want to use a 4-cutter restriction enzyme instead of a 6-cutter. (For example, if your metagenome assembly has an N50 of only 4 Kb, then a Hi-C library made with a 6-cutter - which cuts roughly every 4 Kb - will be completely unable to cluster 50% of the assembly's sequence.)

C3. Aligning the Hi-C reads to the draft metagenome assembly

In addition to the metagenome assembly itself, MetaPhase inputs an alignment of the Hi-C reads to the metagenome assembly. This file must be in SAM or BAM format, and it must contain each Hi-C read only once. You can use any aligner that produces SAM/BAM files; I used bwa while developing MetaPhas; if you use bwa, make sure to use bwa aln and bwa sampe, not bwa mem, which outputs each read multiple times!)

Hi-C reads are unique: they are deliberately chimeric, with a chimeric ligation site whose sequence is known from the restriction enzyme (e.g., HindIII cuts at AAGCTT and produces AAGCTAGCTT upon re-ligation.) Because of this, a straightforward alignment approach will miss many useful Hi-C pairings. You might want to design a custom alignment pipeline to maximize your yield; if so, take a look at the script align.iter.interactive.sh, which I used in my own development and may give you ideas for your custom pipeline.

C4. Supplying reference genomes

Supplying reference genomes to MetaPhase is entirely optional, but very useful. MetaPhase can align the contigs of the draft assembly in order to get a clue of what species they're likely to be from. There are two ways to do this: aligning to a BLAST database containing all known sequences; and aligning to a local fasta file containing a single reference genome assembly. The former method is useful for exploring the question of what taxa are in your sample; the latter method is useful for zeroing in on individual species that you know to be in your sample (or related to things in your sample) and for creating the heatmap and cluster network images (see section E2, "Images"). You can start with a BLAST alignment only, then use the hits from there to determine what species you're likely to encounter, then download those references and feed them into MetaPhase. The list of reference genome assemblies is supplied to MetaPhase in the file <scenario>.refs.tsv.

MetaPhase will perform all of the alignments to both the BLAST database and to the reference genomes. MetaPhase may call the BLAST commands blastn, tblastx, and makeblastdb, all of which are part of the BLAST command-line code package; make sure these commands are in your $PATH. Note that BLAST can use a lot of runtime, especially if you set --use-tblastx. However, MetaPhase caches the results of the BLAST runs in <out_dir>/cached_data in order to save runtime later.

C5. Creating TSV files

MetaPhase requires two input TSV (tab-separated value) files: one to give it the set of SAM/BAM alignment files, and one to give it the set of reference genome assemblies. These files are small, and you'll want to make them by hand, especially because they may need manual modification later. The easiest thing to do is follow the example of the TSV files in test_case/tsvs.


D. Running MetaPhase

To get a quick summary of all of MetaPhase's command-line arguments, run MetaPhase -help. A more detailed explanation is here.

D1. Required command-line arguments

  • -s <string>: Scenario name. This is used by MetaPhase to name your run. It is used as the beginning of the name of the tsv files (see -i below) and also as the name of the output directory (see -o below). Lastly, you can ignore this, but there are some scenario names that have hard-wired command-line options that I used in development (this is, for example, why -a is not listed as a "required" argument in the MetaPhase -help command.)
  • -a <string>: The location of the draft de novo metagenome assembly fasta file. This must be an absolute path, not a relative path - i.e., it must start with /.
  • -N <integer>: The number of clusters to create. If you set to 1, MetaPhase will cluster everything into a single cluster and will calculate E(N), the intra-cluster link enrichment, along the way, then write a file enrichment_curve.jpg that can give you an estimate of the number of species in your sample. Do not set to 0 or to a number larger than the number of contigs.

D2. Optional command-line arguments

Some of these arguments include $HOME in their default values. This refers to your UNIX home directory (the place where you go when you type cd ~ or cd $HOME.)

  • -i <string>: Input directory. This is the directory containing the tsv files, <scenario>.HiC_libs.tsv and <scenario>.refs.tsv. Default: ./input.
  • -o <string>: Output root directory. Output files from this run will go in <out_dir>/<scenario>. Default: $HOME/MP/out.
  • --blast_dir <string>: Directory containing the BLAST databases (nt.* and nr.*) that MetaPhase uses for alignments. You can download these files from the BLAST website. Default: $HOME/extern/blast.
  • --refs_dir <string>: Directory containing the reference genome assemblies that are listed in the the refs.tsv file. Output files from this run will go in <out_dir>/<scenario>. Default: $HOME/MP/refs.
  • --use_tblastx: Toggle. If set, MetaPhase will perform its BLAST alignments using tblastx instead of blastn - which, instead of aligning the contigs' nucleotide sequences against a nucleotide database, translates the nucleotides to amino acids and aligns them to a protein database. Because protein sequences are more conserved than nucleotide sequences, tblastx picks up more distant phylogenetic relationships - i.e., at the level of family or genus instead of species - which you may or may not want. tblastx is also much slower than blastn.
  • --force_blast_realign: Toggle. If set, MetaPhase will ignore and overwrite any cached files describing BLAST alignments.
  • -b: Toggle. Apply statistical bootstrapping to the link matrix. In other words, once the matrix of Hi-C links is created (and before it's normalized), resample the matrix with replacement, creating a new matrix with the same total number of links but a random variation in the exact placement of links. If you want to test the robustness of your clustering result, run MetaPhase with -b several times and compare the results, which should be stochastically different.
  • --isolated_component_size <integer>: After creating the contig connectivity graph from the link matrix, discard any component in the graph with fewer than this number of contigs. In most datasets, such components consist of noise that cannot be reliably placed into any species, and because these components can never be combined with other components, they may throw off the apparent cluster number. However, if your Hi-C link data is sparse, you may need to reduce this in order to avoid junking real clusters. Default: 100.
  • --jarvis_patrick_K <integer>: The value of K used in the Jarvis-Patrick pre-clustering step. Higher values increase runtime but may incrase accuracy. To understand this number in detail, go to Jarvis and Patrick, "Clustering Using a Similarity Measure Based on Shared Near Neighbors", 1973. Default: 100.
  • --min_cluster_norm <integer>: The minimum norm of an allowable cluster. The "norm" of a contig is the number of restriction enzyme (RE) sites it contains, and the norm of a cluster is the sum of the norms of its contigs. This parameter can have a big effect on the output: increasing the min_cluster_norm will increase the minimum possible size of a cluster, potentially destroying clusters that represent small species; but it also prevents the formation of annoying little mini-clusters containing only a small number of contigs (2-3) that often result from noise in the data and/or from repetitive contigs. If you're getting one huge cluster with most of your contigs and all the other clusters are tiny, you need to increase this. Keep in mind that the norm of a cluster is roughly equal to its length in bp divided by the RE site frequency, so the optimal number may be different for different types of REs. Default: 25.
  • -merge: Toggle. Apply some experimental clustering algorithms to merge multiple independent clusterings arising from different Hi-C libraries. Not recommended.

D3. Output command-line arguments (all optional toggles)

  • --load_cached_clusters: Toggle. If set, instead of performing clustering, MetaPhase will look for a cached file in <out_dir>/<scenario>.cached_data that contains previously calculated clustering results. This file will exist if MetaPhase has previously been run on this scenario with the same value of -N as now and without the flag --dont_output_cache. This is a way to save time if you want to analyze clustering results without re-running clustering.
  • --dont_output_cache: Toggle. Don't create (or overwrite) a cluster cache file that can be loaded later with --load_cached_clusters. This option is overridden by --load_cached_clusters.
  • --report_unclustered: Toggle. If set, MetaPhase produces a final report on the contigs that it did not cluster: what species they align to, how many of them are unclustered because they are entirely unlinked, etc.
  • --output_cluster_fastas: Toggle. If set, write the files cluster.*.fasta and unclustered.fasta in <out_dir>/<scenario>/. These are fasta files that indicate how MetaPhase has clustered the contigs.
  • --output_network_image: Toggle. If set, run the script Fig2a.R and create a cluster network image like the one that appears in Figure 2A of the MetaPhase paper.
  • --output_heatmaps: Toggle. If set, run the script MakeClusteringResultHeatmap.R and create heatmap images like the ones that appear in Figure 2B and Figure S5 of the MetaPhase paper.
  • --reorder_clusters_by_refs: Toggle. If set, reorders the clusters so as to maximize the signal on the diagonals of the heatmaps created with --output_heatmaps. Note that this will depend on which reference genomes are in your refs.tsv file and the order in which they appear in the file. If you want to have consistent cluster numbers, don't set this.

E. MetaPhase output

E1. Charts

The most basic output of MetaPhase is to the screen. MetaPhase will give verbose reports on its progress as it performs pre-processing, BLAST alignments, clustering, and post-clustering analysis. Assuming it doesn't crash, MetaPhase produces a nice handy-dandy chart of its clustering results. The columns in this chart are:

  • Cluster number
  • Number of contigs in this cluster
  • Total length of all contigs in this cluster
  • Abundance: an estimate of the DNA abundance (not the species abundance) of this cluster. Defined as the percentage of shotgun reads that align to the contigs in this cluster. Requires a shotgun abundance SAM/BAM file, which currently requires hacking.
  • Plurality reference: the reference genome assembly (among those listed in the refs.tsv file) to which a plurality of the sequence aligns
  • %eukaryotic, %rDNA, %tRNA, %mtDNA: Predicted annotations of the sequence content in this cluster. Based on BLAST alignments to the BLAST database.
  • Plurality taxonomy: The most common taxonomic placements of the sequence content in this cluster. Based on BLAST alignments to the BLAST database.

If you set --report_unclustered, MetaPhase will also create a much smaller and simpler chart describing the unclustered contigs.

E2. Images

You can create pretty images like those in Figure 2A and 2B of the MetaPhase paper. To create the network image or the heatmap, set --output_network_image or --output_heatmap, respectively. The files are created by the scripts Fig2a.R and MakeClusteringResultHeatmap.R, respectively. These are fairly straightforward R scripts using ggplot2; if you want to tweak the appearance of the images, just tweak the scripts. These files, by default, are created in $HOME/public_html; you may need to create this directory in order to get the files to appear.

E3. Files in the output directory

MetaPhase will create the following files in <out_dir>/<scenario>:

  • assembly.blastn_report: A human-readable file conveniently summarizing the BLAST alignments of the draft assembly to the nt database.
  • result.human_readable.txt: A human-readable file listing each contig in the draft metagenome assembly and indicating how it's been clustered.
  • cluster.*.fasta and unclustered.fasta: Fasta files containing the contigs in each cluster. Created only if you run with --output_cluster_fastas.
  • Subdirectory cached_data: Contains cache files describing BLAST alignments to reference genomes (MapToRefs.txt*); BLAST alignments to the nt database (assembly._blast); and clustering results (clusters.). These files may not be particularly human-readable.
  • Subdirectory LACHESIS: Empty unless you run Lachesis after running MetaPhase (see next section).

E4. Running LACHESIS

As demonstrated in the MetaPhase paper, it is possible to run MetaPhase to create separate clusters for each species, then subsequently run LACHESIS to create chromosome-scale scaffolds of the contigs in that cluster, thus generating a high-quality single-species assembly from nothing but metagenomic data. However, several caveats apply:

  1. This is only likely to work for eukaryotes, because LACHESIS' method doesn't really apply to prokaryotic genomes.
  2. You will have to know your species' chromosome number, because LACHESIS cannot predict chromosome numbers as precisely as MetaPhase can predict species numbers.
  3. You will have to re-align your Hi-C reads to the contigs in the cluster you are studying.
  4. In yeast species, be careful not to cluster all of the centromere-containing contigs into a single chromosome. You will have to use LACHESIS' CLUSTER_CONTIGS_WITH_CENS option.

F. Troubleshooting

F1. MetaPhase won't run!

There should be an executable file called MetaPhase. Type MetaPhase at the command line. If you get an error like "command not found", then you either aren't in the correct MetaPhase directory, or you haven't successfully completed compilation. If MetaPhase is ready to run, then typing MetaPhase will produce a PARSE ERROR and MetaPhase will describe to you the command-line arguments it needs.

If you get the following error: MetaPhase: error while loading shared libraries: libboost_filesystem.so.1.47.0: cannot open shared object file: No such file or directory then you need to add the directory containing libboost_filesystem.so.1.47.0 to your environment variable $LD_LIBRARY_PATH. Type this command: LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<MetaPhase>/include/boost_1_47_0/stage/lib

F2. MetaPhase is crashing!

If MetaPhase crashes, the first thing you should do is look carefully at its output. It might give a verbose explanation of what went wrong and give you a good idea for how to fix it. The most common problem is one of the input files either couldn't be found or contains nonsensical data.

You may also receive an "assertion error", which looks like this: Assertion ... failed. That means that at some stage of the algorithm, MetaPhase encountered something specific that it wasn't expecting. An assertion error will come with a reference to the file (*.cc or *.h) and the line number where the error occurred. Try looking at that line in the file, which should contain the function assert(). There should be some comments around that line that explain what might be causing the assertion error.

In general, we've made a strong effort to make MetaPhase a well-designed and well-commented piece of code. If you're familiar with C++, you should be able to poke around in the source code and get an idea of what's going on. We recommend starting with the top-level module, MetaPhase.cc, and working from there.

F3. MetaPhase is producing a weird result!

After you've gotten MetaPhase to run properly, take a good look at the outputs, especially the report chart. If you're getting a weird result - for example, very little sequence is being assembled, or most of the sequence is clustered into a single cluster (a common problem) - you may need to tune MetaPhase's performance. Take a good look at section D2, "Optional command-line arguments".

COPYRIGHT AND DISCLAIMER

The MetaPhase software package and all software and documentation contained with it are copyright © 2013-2014 by Josh Burton and the University of Washington. All rights are reserved.

This software is supplied 'as is' without any warranty or guarantee of support. The University of Washington is not responsible for its use, misuse, or functionality. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability arising from, out of, or in connection with this software.

ACKNOWLEDGMENTS

Thanks to Ivan Liachko for making the MetaPhase project possible and for generating all of the Hi-C data used by the MetaPhase software.

Thanks to Maitreya Dunham and Jay Shendure for leadership, management, and ideas.

Thanks to Kathryn Bushley, David Fredricks, Steve Salipante, Laura Sycuro, and Andrew Wiser for patiently helping me test and troubleshoot MetaPhase.

Thanks to Aaron McKenna for helping make MetaPhase available over GitHub.

(from Andrew) you are giving users a new, limited BOOST install that they might not want to interact with the rest of their system by adding to their LD_LIBRARY_PATH. My suggestion would be to recommend the user adds the include/boost_1_47_0/stage/lib directory to their LD_LIBRARY_PATH if possible, but to also provide a wrapper script that will set the environmental variable at runtime instead. A Python program that checks if the correct path is in the user LD_LIBRARY_PATH, sets it if not found, and then executes the program would be really easy to cook up quickly.