This code is designed to enable anyone to reproduce the Hs2-HiC and the AaegL4 genomes reported in: Dudchenko et al., De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science, 2017.
Unless otherwise noted, all terminology below is consistent with this paper, and all references to figures and tables in this readme refer to this paper. Specifically, some of the terminology used below is outlined in Figure S2
. The assembly procedure is described in detail in the Supporting Online Materials, specifically in the section labelled “Pipeline description”.
In addition, the pipeline uses tools and methods from Juicer (Durand & Shamim et al., Cell Systems, 2016) and Juicebox (Durand & Robinson et al., Cell Systems, 2016), as well as additional dependencies noted below.
Feel free to post your questions and comments at: http://www.aidenlab.org/forum.html
As input, the pipeline requires: [1] a fasta file describing the draft assembly, and [2] a file called “merged_nodups.txt” which is generated by the Juicer pipeline, and contains a duplicate-free list of paired alignments from an in situ Hi-C experiment.
The output of the pipeline includes a fasta file containing the output scaffolds, as well as hic files corresponding to various stages of the assembly.
Software version: 170123
An overview of the detailed workflow of the 3D DNA pipeline is schematically given in Fig. S1
(see Dudchenko et al., 2017
).
We begin with a series of iterative steps whose goal is to eliminate misjoins in the input scaffolds. Each step begins with a scaffold pool (initially, this pool is the set of input scaffolds themselves). The scaffolding algorithm is used to order and orient these scaffolds. Next, the misjoin correction algorithm is applied to detect errors in the scaffold pool, thus creating an edited scaffold pool. Finally, the edited scaffold pool is used as an input for the next iteration of the misjoin correction algorithm. The ultimate effect of these iterations is to reliably detect misjoins in the input scaffolds without removing correctly assembled sequence. After this process is complete, the scaffolding algorithm is applied to the revised input scaffolds, and the output – a single “megascaffold” which concatenates all the chromosomes – is retained for post-processing.
This post-processing includes four steps: (i) a polishing algorithm, which is required for genomes in the Rabl configuration; (ii) a chromosome splitting algorithm, which is used to extract the chromosome-length scaffolds from the megascaffold; (iii) a sealing algorithm, which detects false positives in the misjoin correction process, and restores the erroneously removed sequence from the original scaffold; and (iv) a merge algorithm, which corrects misassembly errors due to undercollapsed heterozygosity in the input scaffolds. Steps (i) and (iv) are omitted for Hs2-HiC, which is not in the Rabl configuration and lacks substantial undercollapsed heterozygosity.
LastZ (version 1.03.73 released 20150708)
– for AaegL4 onlyPython >=2.7
scipy numpy matplotlib
Java version >=1.7
Bash >=4
GNU Awk >=4.0.2
GNU coreutils sort >=8.11
GNU Parallel >=20150322
– highly recommended to increase performance
There is no need for installation.
The pipeline was run on the following operating systems:
Red Hat Enterprise Linux Server release 7.3
CentOS Linux release 7.2.1511
Ubuntu 14.04.5
OS X Yosemite 10.10.5
The pipeline consists of one bash wrapper script (run-pipeline.sh) that calls individual modules to assemble a genome. Run
./run-pipeline.sh –h
to list arguments and parameters. The latter include:
-m haploid/diploid
- specifies whether to run the pipeline in the haploid (Hs2-HiC) or diploid (AaegL4) mode. The default is the haploid mode
-t int
- specifies the minimal draft contig/scaffold size to be considered as input
-s int
- specifies the number of rounds of misjoin correction
-c int
- specifies the expected number of chromosomes
The wrapper script calls individual modules of the pipeline. Code related to any particular module is organized into individual folders. The modules can also be run as separate scripts. The list of individual modules with their main wrapper scripts is given below:
a) scaffold (./run-liger-scaffolder.sh)
– scaffold the draft assembly
b) visualize (./run-asm-visualizer.sh)
– make Juicebox-compatible contact map
c) edit (./run-misassembly-detector.sh)
– annotate regions consistent with a misassembly
d) polish (./run-asm-pplisher.sh)
– polish assembly (AaegL4 only)
e) split (./run-asm-splitter.sh)
– split scaffolder output into raw chromosomal scaffolds
f) seal (./seal-asm.sh)
– reintroduce some fragments back into the assembly
g) merge (./run-asm-merger.sh)
– merge fragments of the assembly that represent alternative haplotypes (AaegL4 only)
h) finalize (./finalize-output.sh)
– generates an output fasta
Additional modules include:
i) utils
– several core scripts that are used across modules
j) lift
– several core scripts to liftover coordinates from the draft to assembly and vice versa
k) supp
– several additional scripts to match available map data and generate initial conditions
l) data
– mapping data tables used for validation
The pipeline generates a number of files. The types of files are listed below. The main outputs are the fasta files annotated as “FINAL” which contains the output scaffolds.
a) .fasta files
- “FINAL” – chromosome-length scaffolds, small and tiny scaffolds
- “rawchrom” – Scaffold populations analyzed in Table S2.
b) .hic files
- “polished” – after polishing stage (for diploid mode only)
- “resolved” – after editing and scaffolding
- [0123…] – correspond to the assembly at individual editing iterations
c) .scaffold_track.txt & .superscaf_track.txt
- Scaffold boundary files (Juicebox 2D annotation format)
d) .bed & .wig files
- Tracks illustrating putative misjoins
e) .cprops and .asm files
- Custom files that tracks modifications to the input contigs at various stages in the assembly.
f) supplementary files:
- edits.for.step.*.txt; mismatches.at.step.*.txt; suspect_2D.at.step.*.txt - list of problematic regions (Juicebox 2D annotation format)
- alignments.txt (for diploid mode only) – pairwise alignment data for alternative haplotype candidates, parsed from LASTZ output
a) Where to obtain the input files:
-
Draft fasta files:
-
For Hs2-HiC:
Download the Hs1.fasta draft assembly from GEO (series accession number GSE95797).
-
For AaegL4:
Run
./supp/get-AaegL2.sh
The script will download the scaffold sequences from the AaegL2 assembly (NCBI accession number GCA_000004015.2) and order the fasta in accordance with the NCBI assembly_report file there.
-
-
Merged_nodups.txt files produced using Juicer and the draft assembly:
- Download the Hs1.mnd.txt and AaegL2.mnd.txt from GEO (series accession number GSE95797).
b) How to run the pipeline:
-
Hs2-HiC:
./run-pipeline.sh –m haploid –t 15000 –s 2 –c 23 Hs1.fasta Hs1.mnd.txt
-
AaegL4:
./run-pipeline.sh –m diploid –t 10000 –s 9 –c 3 AaegL2.fasta AaegL2.mnd.txt
Note that the order and orientation of chromosome-length scaffolds in the final assemblies that we uploaded to GEO was changed to match the conventional order and orientation (i.e. hic_scaffold_1 in the GEO upload is a homolog to chromosome 1, hic_scaffold_2 is a homolog to chromosome 2 etc.).
If you use this code or the resulting assemblies, please cite the following paper:
Dudchenko, O., Batra, S.S., Omer, A.D., Nyquist, S.K., Hoeger, M., Durand, N.C., Shamim, M.S., Machol, I., Lander, E.S., Aiden, A.P., et al. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science eaal3327. Published online ahead of print on March 23, 2017.
Please also remember to cite the paper that generated the draft Aedes aegypti assembly:
Nene, V., Wortman, J.R., Lawson, D., Haas, B., Kodira, C., Tu, Z.J., Loftus, B., Xi, Z., Megy, K., Grabherr, M., et al. (2007). Genome sequence of Aedes aegypti, a major arbovirus vector. Science 316, 1718–1723.