Scripts to automate the creation of Tables and Figures for the Assemblathon 1 project analysis.
Keith Bradnam and Ian Korf wrote and retain rights the files extra/cds_blast.pl and extra/FAlite.pm
- python >= 2.5 but less than 3.0.
- assemblaScripts: https://github.com/benedictpaten/assemblaScripts
- matplotlib: http://matplotlib.sourceforge.net/ for plots
- latex: http://www.latex-project.org/ for latex tables
- fltpage: http://www.ctan.org/tex-archive/macros/latex/contrib/fltpage for formatting the latex tables
- R: http://www.r-project.org/ for the correlation plots
- Data: http://compbio.soe.ucsc.edu/assemblathon1/
- Annotations: http://compbio.soe.ucsc.edu/assemblathon1/
- Download the package.
cd
into the directory.- Type
make
.
- Run your assembler on the assemblathon dataset and produce a fasta file. Pick a letter to be your ID from {R, S, T, U, Y, Z}. I'll assume you use R for the rest of this list. Edit the
idMap
dict insrc/libGeneral.py
to add your ID and your name code. This is used throughout the code to place names on figures and tables, and to map filenames to assemblies. - Create a sequence name map using
$ fastaContigHeaderMapper.py --prefix R1 --createMap R1.map < R1.rawAssembly.fa
. - Use the name map to transform all of the sequence names using
$ fastaContigHeaderMapper.py --map R1.map --goForward < R1.rawAssembly.fa > R1.fa
- Standardize the number of Ns in your sequences to be no greater than 25. For Assemblathon 1 25 Ns marked the difference between contigs within a scaffold.
$ standardizeNumNs.py --expandAt 25 < R1.fa > R1_scaffolds.fa
. So if your assembler used 4 Ns as a scaffold gap then you would use--expandAt 4
, and then if there were 4 or more Ns in a row they were made to have 25 Ns. If you looked at the distribution of Ns in the resulting fasta you'd see runs of Ns of length 1, 2, 3 and 25. - Run RepeatMasker on the sequence using the simulated mobile element library available at http://compbio.soe.ucsc.edu/assemblathon1/ . We used
$ RepeatMasker -lib MELib.fa -parallel 10 -qq -xsmall -dir tempRepMaskDir/ R1.fa
. We used RepeatMasker v1.25. - Run trf on the sequence:
$ trf R1.fa 2 7 7 80 10 50 2000 -m -d -h
. We used trf v4.00. - Soft mask the assembly with the union of the RepeatMasker and trf outputs, here I'll call the resulting file
R1_scaffolds.trf.repmask.fa
. Though we do not distribute code to do this in this repository, one way to accomplish this step would be to download and compile the UCSC Genome Browser code base and then use the programmaskOutFa
with a call like$ maskOutFa -softAdd R1.repmask.fa R1.trf.bed R1_scaffolds.trf.repmask.fa
- Split the assembly into contigs using
$ splitSequenceAtNs.py --splitAt 25 < R1_scaffolds.trf.repmask.fa > R1_contigs.trf.repmask.fa
. You now have two fasta files for your assembly, a scaffold and a contig file. You're now ready to run the Cactus aligner. - Run
assemblaScripts
on your data. - Create a new directory where you'd like to create perform an analysis.
- Run
importCactusResult.py
to bring in the necessary files for both the Scaffold and Contig alignments into your new directory. cp
theanalysisMakefile
into the new directory.- You need to place an entry for your assembly in the extra/cdsCoverage.tab table. These numbers were originally generated by Keith Bradnam using his script cds_blast.pl which you can find in extra/ .
- Edit
analysisMakefile
to set the variables forbinDir
,fltpage
andsimulationStatsTab
. ThebinDir
should be the relative path to this repo'sbin/
. ThesimulationStatsTab
should point to the file inextra/
. - Run
make -f analysisMakefile
. Pro tip: The makefile was written to use-j
for a speedup. - Once the make finishes, most of the results are stored in
publication/
with some other results being incorrelations/
,indelDist/
,nNGDiffs/
andphasingAnalysis/
. - There is no step seventeen!
-
Q I'm getting an error in the make when running my own new assembly through the process:
'''shell Traceback (most recent call last):
File "/cluster/home/dearl/assemblathon/assemblAnalysis/bin/summarizedRankingToTeamTop.py", line 181, in main()
File "/cluster/home/dearl/assemblathon/assemblAnalysis/bin/summarizedRankingToTeamTop.py", line 178, in main fullProcessStream( options ) File "/cluster/home/dearl/assemblathon/assemblAnalysis/bin/summarizedRankingToTeamTop.py", line 103, in fullProcessStream
'%s\noffending list:\n %s\n' % ( str(headerList), str(d))) RuntimeError: Error, len(d) != headerList:
['#Assembly', 'Overall', 'N50_CPNG50 (value)', 'N50_SPNG50 (value)', 'structuralContigPathErrors (value)', 'contiguousRanks (value)', 'substitutionErrors (value)', 'copyNumberErrors (value)', 'coverageTotal (value)', 'coverageCDS (value)'] offending list: ... '''A This is usually the result of different orderings between the assemblies in the ranking files. Make sure the order of assemblies in publication/rankings/N50_SPN5G0.tab and extra/cDnaAlignment.tab is the same.
- Earl et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Res (2011) vol. 21 (12) pp. 2224-41 http://genome.cshlp.org/content/21/12/2224