demo_small.sh numpy error

Question

demo_small.sh numpy error

Closed this issue 2 years ago · 4 comments

Hellom

Bug report

I want to run my images with the whole workflow (and with skipping certain steps) so I first wanted to do the examples. I have run demo_tiny.sh and I can run this completely without any issues.

Description of the problem

When I run demo_small.sh with the same parameters as demo_tiny.sh (which is just default parameters, I run the command ./examples/demo_small.sh ../../new_demo/) I get an error at run_airlocalize step which was fine at demo_tiny.sh . In this error, it claims it is numpy version, but I believe the python is used from a singularity container so I am not sure if I can handle it locally.

Log file(s)

Output of the demo_small.sh :

N E X T F L O W  ~  version 22.08.0-edge
Launching `./main.nf` [thirsty_morse] DSL2 - revision: 5c93cdfd3b

===================================
EASI-FISH ANALYSIS PIPELINE
===================================

Pipeline parameters
-------------------
workDir                : /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work
data_manifest          : demo_small
shared_work_dir        : /local1/scratch/erkin/projects/vatsi_multifish/new_demo
segmentation_model_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs/model/starfinity
data_dir               : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs
output_dir             : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs
publish_dir            : 
acq_names              : [LHA3_R3_small, LHA3_R5_small]
ref_acq                : LHA3_R3_small
steps_to_skip          : []

executor >  local (92)
[83/486fb0] process > download (1)                                                  [100%] 1 of 1 ✔
[36/32e6b9] process > stitching:prepare_stitching_data (1)                          [100%] 2 of 2 ✔
[56/7ee694] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2)     [100%] 2 of 2 ✔
[e9/5f1cb5] process > stitching:stitch:spark_cluster:spark_master (1)               [ 50%] 1 of 2
[f9/0742a6] process > stitching:stitch:spark_cluster:wait_for_master (2)            [100%] 2 of 2 ✔
[e0/6e692a] process > stitching:stitch:spark_cluster:spark_worker (1)               [ 50%] 1 of 2
[d6/04da04] process > stitching:stitch:spark_cluster:wait_for_worker (1)            [100%] 2 of 2 ✔
[d2/38475b] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2)      [100%] 2 of 2 ✔
[7c/b4a4d4] process > stitching:stitch:run_czi2n5:spark_start_app (2)               [100%] 2 of 2 ✔
[59/45baec] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[03/fc9ed7] process > stitching:stitch:run_retile:spark_start_app (2)               [100%] 2 of 2 ✔
[de/ee2737] process > stitching:stitch:run_stitching:spark_start_app (2)            [100%] 2 of 2 ✔
[75/05d6ee] process > stitching:stitch:run_fuse:spark_start_app (1)                 [ 50%] 1 of 2
[01/36daf7] process > stitching:stitch:terminate_stitching (1)                      [100%] 1 of 1
[ab/4e8cda] process > spot_extraction:airlocalize:cut_tiles (1)                     [100%] 1 of 1
[0d/46aec0] process > spot_extraction:airlocalize:run_airlocalize (16)              [  0%] 0 of 64
[-        ] process > spot_extraction:airlocalize:merge_points                      -
[-        ] process > segmentation:predict                                          -
executor >  local (92)
[83/486fb0] process > download (1)                                                  [100%] 1 of 1 ✔
[36/32e6b9] process > stitching:prepare_stitching_data (1)                          [100%] 2 of 2 ✔
[56/7ee694] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2)     [100%] 2 of 2 ✔
[72/9bd52e] process > stitching:stitch:spark_cluster:spark_master (2)               [100%] 1 of 1
[f9/0742a6] process > stitching:stitch:spark_cluster:wait_for_master (2)            [100%] 2 of 2 ✔
[ab/54cebf] process > stitching:stitch:spark_cluster:spark_worker (2)               [100%] 1 of 1
[d6/04da04] process > stitching:stitch:spark_cluster:wait_for_worker (1)            [100%] 2 of 2 ✔
[d2/38475b] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2)      [100%] 2 of 2 ✔
[7c/b4a4d4] process > stitching:stitch:run_czi2n5:spark_start_app (2)               [100%] 2 of 2 ✔
[59/45baec] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[03/fc9ed7] process > stitching:stitch:run_retile:spark_start_app (2)               [100%] 2 of 2 ✔
[de/ee2737] process > stitching:stitch:run_stitching:spark_start_app (2)            [100%] 2 of 2 ✔
[81/680763] process > stitching:stitch:run_fuse:spark_start_app (2)                 [100%] 1 of 1
[01/36daf7] process > stitching:stitch:terminate_stitching (1)                      [100%] 1 of 1
[ab/4e8cda] process > spot_extraction:airlocalize:cut_tiles (1)                     [ 50%] 1 of 2
[0b/26c25f] process > spot_extraction:airlocalize:run_airlocalize (15)              [ 11%] 7 of 60, failed: 7
[-        ] process > spot_extraction:airlocalize:merge_points                      -
[-        ] process > segmentation:predict                                          [  0%] 0 of 1
[-        ] process > registration:cut_tiles                                        [  0%] 0 of 1
[-        ] process > registration:fixed_coarse_spots                               [  0%] 0 of 1
[-        ] process > registration:moving_coarse_spots                              [  0%] 0 of 1
[-        ] process > registration:coarse_ransac                                    -
[-        ] process > registration:apply_transform_at_aff_scale                     -
[-        ] process > registration:apply_transform_at_def_scale                     -
[-        ] process > registration:fixed_spots                                      -
[-        ] process > registration:moving_spots                                     -
[-        ] process > registration:ransac_for_tile                                  -
[-        ] process > registration:interpolate_affines                              -
[-        ] process > registration:deform                                           -
[-        ] process > registration:stitch                                           -
[-        ] process > registration:final_transform                                  -
[20/b5b386] process > collect_merge_points:collect_merged_points_files (1)          [100%] 1 of 1 ✔
[-        ] process > warp_spots:apply_transform                                    -
[-        ] process > measure_intensities                                           -
[-        ] process > assign_spots                                                  -
Error executing process > 'spot_extraction:airlocalize:run_airlocalize (50)'

Caused by:
  Process `spot_extraction:airlocalize:run_airlocalize (50)` terminated with an error exit status (139)

Command executed:

  export SCRATCH_DIR=$PROCESS_DIR
  echo "SCRATCH_DIR: $SCRATCH_DIR"
  echo "/app/airlocalize/airlocalize.sh /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53/coords.txt /app/airlocalize/params/air_localize_default_params.txt /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53 _c0.txt"
  /app/airlocalize/airlocalize.sh /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53/coords.txt /app/airlocalize/params/air_localize_default_params.txt /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53 _c0.txt

Command exit status:
  139

Command output:
  SCRATCH_DIR: /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work/3a/fce83861fc8fad0cc0c1210f89e27d
  /app/airlocalize/airlocalize.sh /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53/coords.txt /app/airlocalize/params/air_localize_default_params.txt /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53 _c0.txt
  Creating MCR_CACHE_ROOT mcr_cache_209456
  Running AirLocalize
  Cleaned up temporary files at mcr_cache_209456

Command error:
    File "/miniconda/lib/python3.6/site-packages/numpy/core/__init__.py", line 22, in <module>
      from . import multiarray
    File "/miniconda/lib/python3.6/site-packages/numpy/core/multiarray.py", line 12, in <module>
      from . import overrides
    File "/miniconda/lib/python3.6/site-packages/numpy/core/overrides.py", line 7, in <module>
      from numpy.core._multiarray_umath import (
  ImportError: PyCapsule_Import could not import module "datetime"
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/app/airlocalize/scripts/air_localize_mcr.py", line 2, in <module>
      import zarr
    File "/miniconda/lib/python3.6/site-packages/zarr/__init__.py", line 2, in <module>
      from zarr.codecs import *
    File "/miniconda/lib/python3.6/site-packages/zarr/codecs.py", line 2, in <module>
      from numcodecs import *
    File "/miniconda/lib/python3.6/site-packages/numcodecs/__init__.py", line 27, in <module>
      from numcodecs.zlib import Zlib
    File "/miniconda/lib/python3.6/site-packages/numcodecs/zlib.py", line 5, in <module>
      from .compat import ndarray_copy, ensure_contiguous_ndarray
    File "/miniconda/lib/python3.6/site-packages/numcodecs/compat.py", line 7, in <module>
      import numpy as np
    File "/miniconda/lib/python3.6/site-packages/numpy/__init__.py", line 140, in <module>
      from . import core
    File "/miniconda/lib/python3.6/site-packages/numpy/core/__init__.py", line 48, in <module>
      raise ImportError(msg)
  ImportError: 
  
  IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
  
  Importing the numpy C-extensions failed. This error can happen for
  many reasons, often due to issues with your setup or how NumPy was
  installed.
  
  We have compiled some common reasons and troubleshooting tips at:
  
      https://numpy.org/devdocs/user/troubleshooting-importerror.html
  
  Please note and check the following:
  
    * The Python version is: Python3.6 from "/miniconda/bin/python"
    * The NumPy version is: "1.19.4"
  
  and make sure that they are the versions you expect.
  Please carefully study the documentation linked above for further help.
  
  Original error was: PyCapsule_Import could not import module "datetime"
  
  /app/airlocalize/airlocalize.sh: line 22: 209851 Segmentation fault      python /app/airlocalize/scripts/air_localize_mcr.py $*

Work dir:
  /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work/3a/fce83861fc8fad0cc0c1210f89e27d

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Nextflow.log should be attached below.

Environment

EASI-FISH Pipeline version: 2.4.0? Freshly cloned from github and pull says it's up-to-date :)
Nextflow version: 22.08.0-edge
Container runtime: Singularity
Platform: Local
Operating system: Linux - RHEL 7

Additional context

As an additional context, I wanted to run some of my own images, but it failed. That's why I wanted to make sure I can run all the examples first. My own work fails at stitching step, where spark workers give Out of Memory error. I was hoping to reproduce this by using bigger examples, but to my suprise I got stuck at a different step :)

nextflow.log

Answer 1 · 2022-08-11T00:40:09.000Z

Thanks for reporting this. That is a very odd error, since the Singularity container is supposed to isolate the code and dependencies to prevent exactly this type of issue. I've never seen anything like this before, despite having run the demos on many different systems. It also doesn't make any sense why it would only happen on small and not tiny data. The code is the same in either case.

Until we can reproduce this, there's not much we can do, but we'll keep an eye on it.

In the meantime, can you try using RS-FISH for spot detection instead?
https://janeliascicomp.github.io/multifish/modules/RS-FISH.html

Answer 2 · 2022-08-11T13:34:26.000Z

Thank you for your reply @krokicki . Indeed it is also puzzling me, I thought maybe the container was updated for tiny but not small or so, but I guess that's not the case. I am not sure if there are any home folder or environmental variables that might be causing this. If possible to pass parameters, I can try to run the containers --no-home and --cleanenv .

I have also tried with RS-FISH, which gave me a different kind of error. Sparkworker gave out of memory error. I think I see java Xmx to be set to 1 GB, and playing with worker memory or driver memory parameters didn't help. I attach the log files and I paste the stdout below. Thank you once again for your assistance :)


examples/demo_small.sh ../../new_demo/ --use_rsfish --rsfish_min 0 --rsfish_max 4096 --rsfish_anisotropy 0.7 --rsfish_sigma 1.5
N E X T F L O W  ~  version 22.08.0-edge
Launching `./main.nf` [tender_ekeblad] DSL2 - revision: 5c93cdfd3b

===================================
EASI-FISH ANALYSIS PIPELINE
===================================

Pipeline parameters
-------------------
workDir                : /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work
data_manifest          : demo_small
shared_work_dir        : /local1/scratch/erkin/projects/vatsi_multifish/new_demo
segmentation_model_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs/model/starfinity
data_dir               : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs
output_dir             : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs
publish_dir            : 
acq_names              : [LHA3_R3_small, LHA3_R5_small]
ref_acq                : LHA3_R3_small
steps_to_skip          : []

executor >  local (79)
[82/caa9c7] process > download (1)                                                  [100%] 1 of 1 ✔
[73/834616] process > stitching:prepare_stitching_data (2)                          [100%] 2 of 2 ✔
[80/814fb3] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (1)     [100%] 2 of 2 ✔
[41/5d599b] process > stitching:stitch:spark_cluster:spark_master (2)               [100%] 2 of 2 ✔
[13/6045e7] process > stitching:stitch:spark_cluster:wait_for_master (2)            [100%] 2 of 2 ✔
[34/4c9778] process > stitching:stitch:spark_cluster:spark_worker (2)               [100%] 2 of 2 ✔
[d3/b431eb] process > stitching:stitch:spark_cluster:wait_for_worker (2)            [100%] 2 of 2 ✔
[fe/c77394] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2)      [100%] 2 of 2 ✔
[9a/b8ef1f] process > stitching:stitch:run_czi2n5:spark_start_app (1)               [100%] 2 of 2 ✔
[0e/c22d54] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[7d/3551c8] process > stitching:stitch:run_retile:spark_start_app (2)               [100%] 2 of 2 ✔
[c0/cb34b3] process > stitching:stitch:run_stitching:spark_start_app (2)            [100%] 2 of 2 ✔
[c5/e585a9] process > stitching:stitch:run_fuse:spark_start_app (2)                 [100%] 2 of 2 ✔
[ec/9b4340] process > stitching:stitch:terminate_stitching (2)                      [100%] 2 of 2 ✔
[5e/00a3bd] process > spot_extraction:rsfish:prepare_spots_dirs (2)                 [100%] 2 of 2 ✔
[9a/36a9c7] process > spot_extraction:rsfish:spark_cluster:prepare_spark_work_dir   [100%] 1 of 1 ✔
[9e/d1f251] process > spot_extraction:rsfish:spark_cluster:spark_master             [  0%] 0 of 1
[5e/aedb70] process > spot_extraction:rsfish:spark_cluster:wait_for_master          [100%] 1 of 1 ✔
[e2/55fb98] process > spot_extraction:rsfish:spark_cluster:spark_worker (3)         [  0%] 0 of 6
[e2/85b243] process > spot_extraction:rsfish:spark_cluster:wait_for_worker (2)      [100%] 6 of 6 ✔
[d3/1ff87e] process > spot_extraction:rsfish:run_rsfish:spark_start_app (1)         [  0%] 0 of 2
[-        ] process > spot_extraction:rsfish:terminate_rsfish                       -
[-        ] process > spot_extraction:rsfish:postprocess_spots                      -
[30/8a2d61] process > segmentation:predict (1)                                      [  0%] 0 of 1
[7b/cfeb5c] process > registration:cut_tiles (1)                                    [100%] 1 of 1 ✔
[26/eec29b] process > registration:fixed_coarse_spots (1)                           [100%] 1 of 1 ✔
[8c/8dff84] process > registration:moving_coarse_spots (1)                          [100%] 1 of 1 ✔
[d3/cc2e72] process > registration:coarse_ransac (1)                                [100%] 1 of 1 ✔
[8f/f5fec1] process > registration:apply_transform_at_aff_scale (1)                 [100%] 1 of 1 ✔
[7d/3da1c8] process > registration:apply_transform_at_def_scale (1)                 [100%] 1 of 1 ✔
[7f/6e98e5] process > registration:fixed_spots (5)                                  [100%] 6 of 6 ✔
[d6/37c18e] process > registration:moving_spots (5)                                 [100%] 6 of 6 ✔
[3f/b7faf0] process > registration:ransac_for_tile (6)                              [100%] 6 of 6 ✔
[48/f77ee5] process > registration:interpolate_affines (1)                          [100%] 1 of 1 ✔
[0e/f2d476] process > registration:deform (6)                                       [  0%] 0 of 6
[-        ] process > registration:stitch                                           -
[-        ] process > registration:final_transform                                  -
[85/f9db2b] process > collect_merge_points:collect_merged_points_files (1)          [100%] 1 of 1 ✔
[-        ] process > warp_spots:apply_transform                                    -
[-        ] process > measure_intensities                                           -
[-        ] process > assign_spots                                                  -
Error executing process > 'spot_extraction:rsfish:spark_cluster:spark_worker (4)'

Caused by:
executor >  local (79)
[82/caa9c7] process > download (1)                                                  [100%] 1 of 1 ✔
[73/834616] process > stitching:prepare_stitching_data (2)                          [100%] 2 of 2 ✔
[80/814fb3] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (1)     [100%] 2 of 2 ✔
[41/5d599b] process > stitching:stitch:spark_cluster:spark_master (2)               [100%] 2 of 2 ✔
[13/6045e7] process > stitching:stitch:spark_cluster:wait_for_master (2)            [100%] 2 of 2 ✔
[34/4c9778] process > stitching:stitch:spark_cluster:spark_worker (2)               [100%] 2 of 2 ✔
[d3/b431eb] process > stitching:stitch:spark_cluster:wait_for_worker (2)            [100%] 2 of 2 ✔
[fe/c77394] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2)      [100%] 2 of 2 ✔
[9a/b8ef1f] process > stitching:stitch:run_czi2n5:spark_start_app (1)               [100%] 2 of 2 ✔
[0e/c22d54] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[7d/3551c8] process > stitching:stitch:run_retile:spark_start_app (2)               [100%] 2 of 2 ✔
[c0/cb34b3] process > stitching:stitch:run_stitching:spark_start_app (2)            [100%] 2 of 2 ✔
[c5/e585a9] process > stitching:stitch:run_fuse:spark_start_app (2)                 [100%] 2 of 2 ✔
[ec/9b4340] process > stitching:stitch:terminate_stitching (2)                      [100%] 2 of 2 ✔
[5e/00a3bd] process > spot_extraction:rsfish:prepare_spots_dirs (2)                 [100%] 2 of 2 ✔
[9a/36a9c7] process > spot_extraction:rsfish:spark_cluster:prepare_spark_work_dir   [100%] 1 of 1 ✔
[-        ] process > spot_extraction:rsfish:spark_cluster:spark_master             -
[5e/aedb70] process > spot_extraction:rsfish:spark_cluster:wait_for_master          [100%] 1 of 1 ✔
[50/d57fd4] process > spot_extraction:rsfish:spark_cluster:spark_worker (1)         [100%] 3 of 3, failed: 3
[e2/85b243] process > spot_extraction:rsfish:spark_cluster:wait_for_worker (2)      [100%] 6 of 6 ✔
[-        ] process > spot_extraction:rsfish:run_rsfish:spark_start_app (1)         -
[-        ] process > spot_extraction:rsfish:terminate_rsfish                       -
[-        ] process > spot_extraction:rsfish:postprocess_spots                      -
[-        ] process > segmentation:predict (1)                                      -
[7b/cfeb5c] process > registration:cut_tiles (1)                                    [100%] 1 of 1 ✔
[26/eec29b] process > registration:fixed_coarse_spots (1)                           [100%] 1 of 1 ✔
[8c/8dff84] process > registration:moving_coarse_spots (1)                          [100%] 1 of 1 ✔
[d3/cc2e72] process > registration:coarse_ransac (1)                                [100%] 1 of 1 ✔
[8f/f5fec1] process > registration:apply_transform_at_aff_scale (1)                 [100%] 1 of 1 ✔
[7d/3da1c8] process > registration:apply_transform_at_def_scale (1)                 [100%] 1 of 1 ✔
[7f/6e98e5] process > registration:fixed_spots (5)                                  [100%] 6 of 6 ✔
[d6/37c18e] process > registration:moving_spots (5)                                 [100%] 6 of 6 ✔
[3f/b7faf0] process > registration:ransac_for_tile (6)                              [100%] 6 of 6 ✔
[48/f77ee5] process > registration:interpolate_affines (1)                          [100%] 1 of 1 ✔
[-        ] process > registration:deform (6)                                       -
[-        ] process > registration:stitch                                           -
[-        ] process > registration:final_transform                                  -
[85/f9db2b] process > collect_merge_points:collect_merged_points_files (1)          [100%] 1 of 1 ✔
[-        ] process > warp_spots:apply_transform                                    -
[-        ] process > measure_intensities                                           -
[-        ] process > assign_spots                                                  -
Error executing process > 'spot_extraction:rsfish:spark_cluster:spark_worker (4)'

Caused by:
  Process `spot_extraction:rsfish:spark_cluster:spark_worker (4)` terminated with an error exit status (1)

Command executed:

  echo "Starting spark worker 4 - logging to /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log"
  
  SESSION_FILE="/local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/.sessionId"   
  echo "Checking for $SESSION_FILE"
  SLEEP_SECS=10
  MAX_WAIT_SECS=7200
  SECONDS=0
  
  while ! test -e "$SESSION_FILE"; do
      sleep ${SLEEP_SECS}
      if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
          echo "Waiting for $SESSION_FILE"
          SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
      else
          echo "-------------------------------------------------------------------------------"
          echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE    "
          echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
          echo "-------------------------------------------------------------------------------"
          exit 1
      fi
  done
  
  if ! grep -F -x -q "cccc4a79-0287-40a0-ac0e-0a23b5686344" $SESSION_FILE
  then
      echo "------------------------------------------------------------------------------"
      echo "ERROR: session id in $SESSION_FILE does not match current session            "
      echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
      echo "and that you are not running multiple pipelines with the same --spark_work_dir"
      echo "------------------------------------------------------------------------------"
      exit 1
  fi
  
  
  rm -f /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log || true
  
  export SPARK_ENV_LOADED=
  export SPARK_HOME=/spark
  export PYSPARK_PYTHONPATH_SET=
  export PYTHONPATH="/spark/python"
  export SPARK_LOG_DIR="/local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752"
  
      export SPARK_WORKER_OPTS=" -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=30 -Dspark.worker.cleanup.appDataTtl=1 -Dspark.port.maxRetries=64"
  
  . "/spark/sbin/spark-config.sh"
  . "/spark/bin/load-spark-env.sh"
  
  
  SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
  echo "Use Spark IP: $SPARK_LOCAL_IP"
  
  
  echo "    /spark/bin/spark-class org.apache.spark.deploy.worker.Worker     spark://172.17.0.1:7077     -c 8     -m 120G     -d /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752     -h $SPARK_LOCAL_IP     --properties-file /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/spark-defaults.conf     "
  
  /spark/bin/spark-class org.apache.spark.deploy.worker.Worker     spark://172.17.0.1:7077     -c 8     -m 120G     -d /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752     -h $SPARK_LOCAL_IP     --properties-file /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/spark-defaults.conf     &> /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log &
  spid=$!
  
  trap "kill -9 $spid" EXIT
  
  while true; do
  
      if ! kill -0 $spid >/dev/null 2>&1; then
          echo "Process $spid died"
          exit 1
      fi
  
      if [[ -e "/local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/terminate-rsfish" ]] ; then
          break
      fi
  
      sleep 1
  done

Command exit status:
  1

Command output:
  Starting spark worker 4 - logging to /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log
  Checking for /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/.sessionId
  Use Spark IP: 172.17.0.1
      /spark/bin/spark-class org.apache.spark.deploy.worker.Worker     spark://172.17.0.1:7077     -c 8     -m 120G     -d /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752     -h 172.17.0.1     --properties-file /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/spark-defaults.conf     
  Process 63713 died

Command error:
  INFO:    Could not find any nv files on this host!
  .command.sh: line 1: kill: (63713) - No such process

Work dir:
  /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work/01/ceb8676a2d508b5c1dd860c7b29e54

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

sparkworker-4.log
nextflow2.log

Answer 3 · 2022-08-11T18:21:57.000Z

Your RS-FISH spark cluster size is shown in the nextflow.log:

Aug-11 13:22:43.061 [Actor Thread 126] DEBUG nextflow.Nextflow - Spark cluster started:
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow -   Spark work directory: /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow -   Number of workers: 6
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow -   Cores per worker: 8
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow -   GB per worker core: 15

This is a large cluster with 6 workers * 120 GB = 720 GB total memory. According to the log, you were able to run stitching on this data with 1 * 4 * 12 = 48 GB total memory, so your cluster parameters for RS-FISH should be more than large enough. Usually the RS-FISH requires less memory than stitching, not more.

Going back to the error, java.lang.OutOfMemoryError: unable to create new native thread looks like it should be related to memory, but the error is misleading. It's actually a system resource issue with lack of native threads.

It looks like you are running the pipeline on a single system, not a cluster, correct? Are you running anything else that might be using a lot of native threads on that system? You should also check the system limits to make sure there is a decent allowance on user processes, e.g. http://www.mastertheboss.com/jbossas/monitoring/how-to-solve-javalangoutofmemoryerror-unable-to-create-new-native-thread/

In terms of parameters, I would suggest reducing the number of workers and cores per worker, and increase only GB per worker core so that you have at least 48 GB total memory in the cluster. I usually start with the same parameters as the stitching cluster, and reduce the size from there.

Answer 4 · 2022-08-12T08:40:24.000Z

Thank you once again for the reply and also the link! Indeed, I am running this on a single system, which is a rhel 7 server. I didn't think of thread limit at all, while we have enough cores and memory, there are many other people using the server so it is possible that we have reached a thread limit. I will experiment with this and give an update.