nf-core/hlatyping

OptiTypePipeline.py error "Could not retrieve index file"

ibwoo opened this issue · 8 comments

ibwoo commented

I'm using Nextflow v21.02.0-edge, nf-core hlatyping pipeline v1.2.0 and Singularity v3.6.4-1.el7.
The test_rna and test profiles work as expected for me. However, running paired-end RNA fastq data through the pipeline leads to the following error:

Error executing process > 'run_optitype (1)'
Caused by:
  Process `run_optitype (1)` terminated with an error exit status (140)
Command executed:
  OptiTypePipeline.py -i mapped_1.bam mapped_2.bam -e 1 -b 0.009 \
      -p "994-BB-5-GAATTCGT-ATAGAGGC_S402" -c config.ini --rna --outdir 994-BB-5-GAATTCGT-ATAGAGGC_S402
Command exit status:
  140
Command output:
  (empty)
Command error:
  [E::idx_find_and_load] Could not retrieve index file for 'mapped_1.bam'

I've tried running these samples with the sge executor on our cluster and also in local mode after logging in to a session on a work node, which made no difference. I've also tried spawning an interactive shell within the Singularity container and executing the same OptiTypePipeline.py command, but get the same error.

@christopher-mohr keenly identified this issue which seems to match the version in the Singularity container, although I'm not sure that I understand why the test_rna profile would work for me.

Any thoughts would be much appreciated!

ibwoo commented

Investigating this further, Nextflow does seem to abort the OptiType process early after that pysam warning appears. If OptiType is allowed to continue (by running it directly, in verbose mode) it completes successfully and the results match the expected HLA type for the sample. I don't know why this happens on my RNA samples but the pipeline works fine on the test_rna dataset.

Perhaps OptiType could internally change pysam verbosity , Pysam could downgraded below 0.16.0.1 for this pipeline or the pipeline could be modified to ignore this specific Pysam warning.

Please see the verbose output below:

Singularity> OptiTypePipeline.py -i mapped_1.bam mapped_2.bam -e 1 -b 0.009 -p "NEOTUM17" -c config.ini --rna --outdir NEOTUM17 -v
 0:00:01.10 Generating binary hit matrix.
[E::idx_find_and_load] Could not retrieve index file for 'mapped_1.bam'
0:00:01.17 Loading mapped_1.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...3K...4K...5K...6K...7K...8K...9K...10K...11K...12K...13K...14K...15K...16K...17K...18K...19K...20K...21K...22K...23K...24K...25K...26K...27K...28K...29K...
 0:00:04.62 29985 reads loaded. Creating dataframe...
0:02:22.82 Dataframes created. Shape: 29985 x 7339, hits: 4643011 (4643011), sparsity: 1 in 47.40
[E::idx_find_and_load] Could not retrieve index file for 'mapped_2.bam'
0:02:24.55 Loading mapped_2.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...3K...4K...5K...6K...7K...8K...9K...10K...11K...12K...13K...14K...15K...16K...17K...18K...19K...20K...21K...22K...23K...24K...25K...26K...27K...28K...29K...30K...
 0:02:27.92 30043 reads loaded. Creating dataframe...
0:04:42.75 Dataframes created. Shape: 30043 x 7339, hits: 4601971 (4601971), sparsity: 1 in 47.91
0:04:47.38 Alignment pairing completed. 17345 paired, 25210 unpaired, 64 discordant
 0:04:48.64 temporary pruning of identical rows and columns
 0:04:48.86 Size of mtx with unique rows and columns: (1740, 459)
0:04:48.86 determining minimal set of non-overshadowed alleles
 0:04:49.41 Keeping only the minimal number of required alleles (80,)
 0:04:49.42 Creating compact model...
starting ilp solver with 1 threads...
 0:04:50.03 Initializing OptiType model...
WARNING: Initializing ordered Set R with a fundamentally unordered data source
    (type: set).  This WILL potentially lead to nondeterministic behavior in
    Pyomo
GLPSOL: GLPK LP/MIP Solver, v4.65
Parameter(s) specified in the command line:
 --write /tmp/tmppu_fgno6.glpk.raw --wglp /tmp/tmpvw673cfz.glpk.glp --cpxlp
 /tmp/tmp2sow_3ri.pyomo.lp
Reading problem data from '/tmp/tmp2sow_3ri.pyomo.lp'...
/tmp/tmp2sow_3ri.pyomo.lp:12145: warning: lower bound of variable 'x1' redefined
/tmp/tmp2sow_3ri.pyomo.lp:12145: warning: upper bound of variable 'x1' redefined
1554 rows, 854 columns, 5849 non-zeros
466 integer variables, all of which are binary
12611 lines were read
Writing problem data to '/tmp/tmpvw673cfz.glpk.glp'...
10974 lines were written
GLPK Integer Optimizer, v4.65
1554 rows, 854 columns, 5849 non-zeros
466 integer variables, all of which are binary
Preprocessing...
328 hidden covering inequaliti(es) were detected
1553 rows, 853 columns, 5848 non-zeros
466 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  4.000e+00  ratio =  4.000e+00
Problem data seem to be well scaled
Constructing initial basis...
Size of triangular part is 1553
Solving LP relaxation...
GLPK Simplex Optimizer, v4.65
1553 rows, 853 columns, 5848 non-zeros
      0: obj =  -0.000000000e+00 inf =   4.000e+00 (4)
      4: obj =  -0.000000000e+00 inf =   0.000e+00 (0)
*   753: obj =   1.473997700e+04 inf =   3.642e-14 (0)
OPTIMAL LP SOLUTION FOUND
Integer optimization begins...
Long-step dual simplex will be used
+   753: mip =     not found yet <=              +inf        (1; 0)
+   753: >>>>>   1.473997700e+04 <=   1.473997700e+04   0.0% (1; 0)
+   753: mip =   1.473997700e+04 <=     tree is empty   0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
Time used:   0.0 secs
Memory used: 2.4 Mb (2542279 bytes)
Writing MIP solution to '/tmp/tmppu_fgno6.glpk.raw'...
2417 lines were written
 0:04:50.81 Result dataframe has been constructed...
Singularity> ls
NEOTUM17  config.ini  mapped_1.bam  mapped_2.bam
Singularity> ls NEOTUM17/
NEOTUM17_coverage_plot.pdf  NEOTUM17_result.tsv

Thank you for looking into this - I'm going to have a look at this during next weeks hackathon and will work on a revamp of the entire pipeline during that time: https://nf-co.re/events/2021/hackathon-march-2021

If you want to join, please feel free to do so :-)

I ran the pipeline with RNA data and it worked although I have the reported warnings:

less work/46/4c8773ed0b96dc36385a06b1bf926f/.command.err 
[E::idx_find_and_load] Could not retrieve index file for 'mapped_1.bam'
[E::idx_find_and_load] Could not retrieve index file for 'mapped_2.bam'

Potential reasons for the error which causes the pipeline failure I can think of at the moment is a resource issue or the Nextflow version.

Potential reasons for the error which causes the pipeline failure I can think of at the moment is a resource issue or the Nextflow version.

It worked with NF version 19.10.0 build 5170 and NF version 20.10.0 build 5430, which has been used @ibwoo. So this is not the problem here.

Maybe the pysam version being used with optitype is not fully fine then? OptiType might use 0.15 whereas Bioconda ships with 0.16+ now?

Maybe the pysam version being used with optitype is not fully fine then? OptiType might use 0.15 whereas Bioconda ships with 0.16+ now?

But I used the hlatyping pipeline and it worked. So the pysam version shouldn't be the problem.

Its a warning anyways, so I think shouldn't be a big issue then. Anyways, we'll have a look next week to make sure this is no show-stopper...

We will close this for now as we could not reproduce it (anymore). Please reopen if the problem reoccurs.