chadlaing/Panseq

# Genomes

cabeaudoin opened this issue · 6 comments

Hello,

I hope you all are doing well.

I am trying to run >1,000 genomes using panseq and have adapted the contig headers to the style listed in the README file, but I still seem to be getting >40,000 genomes for the run. I have listed an example of the headers found in one of my fasta files below. Any help would be greatly appreciated. Thanks!

$ grep ">" GCA_003398285.1_ASM339828v1_genomic_clean.fna | head

lcl|GCA_003398285.1_ASM339828v1_genomic|contig1
lcl|GCA_003398285.1_ASM339828v1_genomic|contig2
lcl|GCA_003398285.1_ASM339828v1_genomic|contig3
lcl|GCA_003398285.1_ASM339828v1_genomic|contig4
lcl|GCA_003398285.1_ASM339828v1_genomic|contig5
lcl|GCA_003398285.1_ASM339828v1_genomic|contig6
lcl|GCA_003398285.1_ASM339828v1_genomic|contig7
lcl|GCA_003398285.1_ASM339828v1_genomic|contig8
lcl|GCA_003398285.1_ASM339828v1_genomic|contig9
lcl|GCA_003398285.1_ASM339828v1_genomic|contig10

Best,
Chris

Hi Chris,

That format does indeed look correct. Is it possible there is one malformed file somewhere?

Thanks,
Chad

Hey Chad,

Thank you so much for your quick response. I realized that the problem was simply having a "." in the filenames (outside of the .fna). I changed those to "_", as suggested on the FAQs of the website, and it seems to be working! Thanks again and sorry for the mistake.

Best,
Chris

Hi Chris,

I'm glad that it is working for you.

Thanks,
Chad

Hey Chad,

Sorry to be back so soon. I was just wondering if you might know what went wrong during my file execution. In my output directory, I seem to have gotten some ".index" files and some other stuff, but nothing listed from the "output files" section of the README could be found. I tried with just 5 genomes this time. Here is what the output directory looks like:

$ ls -1
944327aa4b46f91b61013c355fc4ee11_9e27ac23e31dbc367f71ac28f143a012_dbtemp.index
ab3814808bfe6fdc84cdb16686577d7c_1d982d7ec09527f4f932d28e466b72ac
GCA_003546285_1_ASM354628v1_genomic_dbtemp.index
GCA_003546425_1_ASM354642v1_genomic_dbtemp.index
GCA_003546445_1_ASM354644v1_genomic_dbtemp.index
Master.log
queryfile_dbtemp
queryfile_dbtemp.index
singleQueryFile.fasta
singleReferenceFile.fasta

and here's what my "settings.txt" file looks, for reference

queryDirectory /home/chris/Documents/genomesqueries
referenceDirectory /home/chris/Documents/genomes/reference
baseDirectory /home/chris/Documents/genomes/output
numberOfCores 5
mummerDirectory /home/chris/software/MUMmer3.23
blastDirectory /home/chris/software/ncbi-blast-2.7.1+/bin
minimumNovelRegionSize 500
novelRegionFinderMode unique
muscleExecutable /usr/bin/
fragmentationSize 500
percentIdentityCutoff 85
coreGenomeThreshold 5
runMode pan

Any help would be greatly appreciated! Thank you for your time.

Best,
Chris

Hi Chris,

The program did not run to completion.
If you run the tests in t/output.t does everything pass?
It could be that one of the external programs isn't recognized.

Thanks,
Chad

Hey Chad,

Thanks very much again for the quick reply. Everything looks good after the t/output.t, and I even tried running just the test genomes using my setup, but I ended getting the same results.

On the command line, I'm executing perl panseq.pl settings.txt

I've attached my Master.log for some hopeful clarification. Any thoughts would be greatly appreciated.

Best,
Chris
Master.log