powerpak/pathospot-compare

parsnp fails on our data -- should I provide a reference genome?

Closed this issue · 2 comments

Hi Ted,

I populated our own metadata sqlite database with your schema, and I made a directory structure for our sequenced genome as you have specified (a subdirectory for each genome, with the names the same as the fasta file, just without ".fasta", and this directory is referred to in the relevant row for each assembly in tAsseblies as the assembly_data_link field).

I managed to copy the database and the genome directory to the vagrant machine, and set the relevant environment variables (PATHOGENDB_URI and IGB_DIR -- IN_QUERY I kept the same) to point to our data.

I then ran "rake all", and it seemed to be going fine until the parsnp step, where I got this error:

please help, thank you :)

For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
*****************************
SETTINGS:
|-refgenome:    out.4.clust/SAMPLE177-203_89_barcode06_consensus.repeat_mask.fasta
|-aligner:      libMUSCLE
|-seqdir:       out.4.clust
|-outdir:       out.4.parsnp
|-OS:           Linux
|-threads:      32
*****************************

<<Parsnp started>>

-->Reading Genome (asm, fasta) files from out.4.clust..
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) ..
  |->[WARNING]: no genbank file provided for reference annotations, skipping..
-->Running Parsnp multi-MUM search and libMUSCLE aligner..
  |->[OK]
-->Running PhiPack on LCBs to detect recombination..
  |->[SKIP]
-->Reconstructing core genome phylogeny..
  |->[OK]
-->Creating Gingr input file..
**ERROR**
The following command failed:
>>/tmp/_MEIKz75db/harvest --midpoint-reroot -u -q -i out.4.parsnp/parsnp.ggr -o out.4.parsnp/parsnp.ggr -n out.4.parsnp/parsnp.tree 
Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team.
**ERROR**

Should I explicitly provide a reference genome? The Parsnp output seems to suggest that might help:
no genbank file provided for reference annotations, skipping..

Hi Ted,

I realised that actually some of the errors had been happening because I was including sequences where most of the genome was Ns... I now filtered these to only have genomes with < 10% Ns. Now the original error is gone, but there is a new error, so I'll close this and open a new issue.