SegataLab/panphlan

panphlan_profiling.py: allow user to specify input files

Closed this issue · 0 comments

For panphlan_profiling.py, the input (--i_dna) is designated by directory, and then panphlan_profiling.py automatically finds the panphlan_map.py output csv files. However, it appears that panphlan_profiling.py just looks for files ending in *.bz2, so if any other files are located in the panphlan_map.py output directory (eg., the initial PANGENOME.tar.bz2 file downloaded via panphlan_download_pangenome.py but is not deleted after uncompression), panphlan_profiling.py dies with an error like the following:

$ panphlan_profiling.py --i_dna Eubacterium_rectale --pangenome Eubacterium_rectale/Eubacterium_rectale_pangenome.tsv --o_matrix  out_matrix --verbose

STEP 1. Processing genes informations from pangenome file...
     Number of reference genomes: 15
     Average number of gene-families per genome: 3042
     Total number of pangenome gene-families 11069

STEP 2. Create coverage matrix
 [I] Reading mapping result file: Eubacterium_rectale.tar.bz2
Traceback (most recent call last):
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/15b5bc2e/bin/panphlan_profiling.py", line 763, in <module>
    main()
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/15b5bc2e/bin/panphlan_profiling.py", line 709, in main
    dna_samples_covs = read_map_results(args.i_dna, args.verbose)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/15b5bc2e/bin/panphlan_profiling.py", line 286, in read_map_results
    dna_samples_covs[dna_sample_id] = read_gene_cov_file(os.path.join(i_dna, dna_covs_file))
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/15b5bc2e/bin/panphlan_profiling.py", line 274, in read_gene_cov_file
    gene, coverage = words[0], int(words[1])
IndexError: list index out of range

It would be helpful to allow the user to provide a list of input files via a text file. An alternative approach of allowing users to provide a comma-separated list of file paths via a CLI parameter can be problematic, given that long file paths can lead to commands that are too long (eg., if processing 100's or 1000's of samples).