Assemblathon Analysis Scripts

Scripts to automate the creation of Tables and Figures for the Assemblathon 1 project analysis.

Authors

Dent Earl, Benedict Paten

Keith Bradnam and Ian Korf wrote and retain rights the files extra/cds_blast.pl and extra/FAlite.pm

Dependencies

python >= 2.5 but less than 3.0.
assemblaScripts: https://github.com/benedictpaten/assemblaScripts
matplotlib: http://matplotlib.sourceforge.net/ for plots
latex: http://www.latex-project.org/ for latex tables
fltpage: http://www.ctan.org/tex-archive/macros/latex/contrib/fltpage for formatting the latex tables
R: http://www.r-project.org/ for the correlation plots
Data: http://compbio.soe.ucsc.edu/assemblathon1/
Annotations: http://compbio.soe.ucsc.edu/assemblathon1/

Installation

Download the package.
cd into the directory.
Type make.

Use

Run your assembler on the assemblathon dataset and produce a fasta file. Pick a letter to be your ID from {R, S, T, U, Y, Z}. I'll assume you use R for the rest of this list. Edit the idMap dict in src/libGeneral.py to add your ID and your name code. This is used throughout the code to place names on figures and tables, and to map filenames to assemblies.
Create a sequence name map using $ fastaContigHeaderMapper.py --prefix R1 --createMap R1.map < R1.rawAssembly.fa .
Use the name map to transform all of the sequence names using $ fastaContigHeaderMapper.py --map R1.map --goForward < R1.rawAssembly.fa > R1.fa
Standardize the number of Ns in your sequences to be no greater than 25. For Assemblathon 1 25 Ns marked the difference between contigs within a scaffold. $ standardizeNumNs.py --expandAt 25 < R1.fa > R1_scaffolds.fa . So if your assembler used 4 Ns as a scaffold gap then you would use --expandAt 4, and then if there were 4 or more Ns in a row they were made to have 25 Ns. If you looked at the distribution of Ns in the resulting fasta you'd see runs of Ns of length 1, 2, 3 and 25.
Run RepeatMasker on the sequence using the simulated mobile element library available at http://compbio.soe.ucsc.edu/assemblathon1/ . We used $ RepeatMasker -lib MELib.fa -parallel 10 -qq -xsmall -dir tempRepMaskDir/ R1.fa . We used RepeatMasker v1.25.
Run trf on the sequence: $ trf R1.fa 2 7 7 80 10 50 2000 -m -d -h. We used trf v4.00.
Soft mask the assembly with the union of the RepeatMasker and trf outputs, here I'll call the resulting file R1_scaffolds.trf.repmask.fa. Though we do not distribute code to do this in this repository, one way to accomplish this step would be to download and compile the UCSC Genome Browser code base and then use the program maskOutFa with a call like $ maskOutFa -softAdd R1.repmask.fa R1.trf.bed R1_scaffolds.trf.repmask.fa
Split the assembly into contigs using $ splitSequenceAtNs.py --splitAt 25 < R1_scaffolds.trf.repmask.fa > R1_contigs.trf.repmask.fa . You now have two fasta files for your assembly, a scaffold and a contig file. You're now ready to run the Cactus aligner.
Run assemblaScripts on your data.
Create a new directory where you'd like to create perform an analysis.
Run importCactusResult.py to bring in the necessary files for both the Scaffold and Contig alignments into your new directory.
cp the analysisMakefile into the new directory.
You need to place an entry for your assembly in the extra/cdsCoverage.tab table. These numbers were originally generated by Keith Bradnam using his script cds_blast.pl which you can find in extra/ .
Edit analysisMakefile to set the variables for binDir, fltpage and simulationStatsTab. The binDir should be the relative path to this repo's bin/. The simulationStatsTab should point to the file in extra/.
Run make -f analysisMakefile. Pro tip: The makefile was written to use -j for a speedup.
Once the make finishes, most of the results are stored in publication/ with some other results being in correlations/ , indelDist/ , nNGDiffs/ and phasingAnalysis/.
There is no step seventeen!

FAQ

Q I'm getting an error in the make when running my own new assembly through the process:

'''shell Traceback (most recent call last):
File "/cluster/home/dearl/assemblathon/assemblAnalysis/bin/summarizedRankingToTeamTop.py", line 181, in main()
File "/cluster/home/dearl/assemblathon/assemblAnalysis/bin/summarizedRankingToTeamTop.py", line 178, in main fullProcessStream( options ) File "/cluster/home/dearl/assemblathon/assemblAnalysis/bin/summarizedRankingToTeamTop.py", line 103, in fullProcessStream
'%s\noffending list:\n %s\n' % ( str(headerList), str(d))) RuntimeError: Error, len(d) != headerList:
['#Assembly', 'Overall', 'N50_CPNG50 (value)', 'N50_SPNG50 (value)', 'structuralContigPathErrors (value)', 'contiguousRanks (value)', 'substitutionErrors (value)', 'copyNumberErrors (value)', 'coverageTotal (value)', 'coverageCDS (value)'] offending list: ... '''

A This is usually the result of different orderings between the assemblies in the ranking files. Make sure the order of assemblies in publication/rankings/N50_SPN5G0.tab and extra/cDnaAlignment.tab is the same.

References

Earl et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Res (2011) vol. 21 (12) pp. 2224-41 http://genome.cshlp.org/content/21/12/2224

dentearl/assemblAnalysis

Assemblathon Analysis Scripts

Authors

Dependencies

Installation

Use

FAQ

References