Error in step06: align genes in geneCluster by mafft and build gene trees
johanneswerner opened this issue · 5 comments
I tried to compare Sulfurimonas genomes but the workflow didn't finish successfully.
./panX.py -fn data/BS_Sulfurimonas -sl Sulfurimonas -t 28
(...)
====== starting step06: align genes in geneCluster by mafft and build gene trees
Traceback (most recent call last):
File "./panX.py", line 287, in <module>
myPangenome.process_clusters()
File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters
myClusterCollector.estimate_raw_core_diversity()
File "/data/tools/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
File "/data/tools/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
File "/data/tools/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/data/tools/pan-genome-analysis/data/BS_Sulfurimonas/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'
Do you have any idea where the problem might originate and how I could solve it? If there is more information I can provide, please let me know.
I am including many draft genomes in my analysis. Could this be part of the problem?
draft genomes per se are not a problem. but incomplete or very diverged genomes are. Try rerunning with -cg 0.7
to use all genes present in >70% of genomes are core genes.
Thank you for the information. I re-run the code with the additional -cg 0.7
parameter, but now the workflow breaks here:
====== starting step08: run fasttree and raxml for tree construction
fasttree time-cost: 4.51 minutes (270.51 seconds)
RAxML tree optimization within the timelimit of 30 minutes
RAxML branch length optimization and rooting
Traceback (most recent call last):
File "./panX.py", line 303, in <module>
myPangenome.build_core_tree()
File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree
aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads)
File "/data/tools/pan-genome-analysis/scripts/sf_core_tree_build.py", line 75, in aln_to_Newick
shutil.copy('RAxML_result.branches', out_fname)
File "/data/miniconda3/envs/panX/lib/python2.7/shutil.py", line 119, in copy
copyfile(src, dst)
File "/data/miniconda3/envs/panX/lib/python2.7/shutil.py", line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'RAxML_result.branches'
Do you have any ideas?
pls check the raxml.log
Thank you for the information, see content of raxml.log
below. Removing these sequences solved the problem.
Option -T does not have any effect with the sequential or parallel MPI version.
It is used to specify the number of threads for the Pthreads-based parallelization
RAxML can't, parse the alignment file as phylip file
it will now try to parse it as FASTA file
ERROR: Sequence GCA_002742735.1_UBA10385_genomic consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence GCA_002742775.1_UBA12504_genomic consists entirely of undetermined values which will be treated as missing data
ERROR: Found 2 sequences that consist entirely of undetermined values, exiting...