demo_small.sh numpy error
Closed this issue · 4 comments
Hellom
Bug report
I want to run my images with the whole workflow (and with skipping certain steps) so I first wanted to do the examples. I have run demo_tiny.sh and I can run this completely without any issues.
Description of the problem
When I run demo_small.sh with the same parameters as demo_tiny.sh (which is just default parameters, I run the command ./examples/demo_small.sh ../../new_demo/
) I get an error at run_airlocalize step which was fine at demo_tiny.sh . In this error, it claims it is numpy version, but I believe the python is used from a singularity container so I am not sure if I can handle it locally.
Log file(s)
Output of the demo_small.sh :
N E X T F L O W ~ version 22.08.0-edge
Launching `./main.nf` [thirsty_morse] DSL2 - revision: 5c93cdfd3b
===================================
EASI-FISH ANALYSIS PIPELINE
===================================
Pipeline parameters
-------------------
workDir : /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work
data_manifest : demo_small
shared_work_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo
segmentation_model_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs/model/starfinity
data_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs
output_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs
publish_dir :
acq_names : [LHA3_R3_small, LHA3_R5_small]
ref_acq : LHA3_R3_small
steps_to_skip : []
executor > local (92)
[83/486fb0] process > download (1) [100%] 1 of 1 ✔
[36/32e6b9] process > stitching:prepare_stitching_data (1) [100%] 2 of 2 ✔
[56/7ee694] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2) [100%] 2 of 2 ✔
[e9/5f1cb5] process > stitching:stitch:spark_cluster:spark_master (1) [ 50%] 1 of 2
[f9/0742a6] process > stitching:stitch:spark_cluster:wait_for_master (2) [100%] 2 of 2 ✔
[e0/6e692a] process > stitching:stitch:spark_cluster:spark_worker (1) [ 50%] 1 of 2
[d6/04da04] process > stitching:stitch:spark_cluster:wait_for_worker (1) [100%] 2 of 2 ✔
[d2/38475b] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2) [100%] 2 of 2 ✔
[7c/b4a4d4] process > stitching:stitch:run_czi2n5:spark_start_app (2) [100%] 2 of 2 ✔
[59/45baec] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[03/fc9ed7] process > stitching:stitch:run_retile:spark_start_app (2) [100%] 2 of 2 ✔
[de/ee2737] process > stitching:stitch:run_stitching:spark_start_app (2) [100%] 2 of 2 ✔
[75/05d6ee] process > stitching:stitch:run_fuse:spark_start_app (1) [ 50%] 1 of 2
[01/36daf7] process > stitching:stitch:terminate_stitching (1) [100%] 1 of 1
[ab/4e8cda] process > spot_extraction:airlocalize:cut_tiles (1) [100%] 1 of 1
[0d/46aec0] process > spot_extraction:airlocalize:run_airlocalize (16) [ 0%] 0 of 64
[- ] process > spot_extraction:airlocalize:merge_points -
[- ] process > segmentation:predict -
executor > local (92)
[83/486fb0] process > download (1) [100%] 1 of 1 ✔
[36/32e6b9] process > stitching:prepare_stitching_data (1) [100%] 2 of 2 ✔
[56/7ee694] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2) [100%] 2 of 2 ✔
[72/9bd52e] process > stitching:stitch:spark_cluster:spark_master (2) [100%] 1 of 1
[f9/0742a6] process > stitching:stitch:spark_cluster:wait_for_master (2) [100%] 2 of 2 ✔
[ab/54cebf] process > stitching:stitch:spark_cluster:spark_worker (2) [100%] 1 of 1
[d6/04da04] process > stitching:stitch:spark_cluster:wait_for_worker (1) [100%] 2 of 2 ✔
[d2/38475b] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2) [100%] 2 of 2 ✔
[7c/b4a4d4] process > stitching:stitch:run_czi2n5:spark_start_app (2) [100%] 2 of 2 ✔
[59/45baec] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[03/fc9ed7] process > stitching:stitch:run_retile:spark_start_app (2) [100%] 2 of 2 ✔
[de/ee2737] process > stitching:stitch:run_stitching:spark_start_app (2) [100%] 2 of 2 ✔
[81/680763] process > stitching:stitch:run_fuse:spark_start_app (2) [100%] 1 of 1
[01/36daf7] process > stitching:stitch:terminate_stitching (1) [100%] 1 of 1
[ab/4e8cda] process > spot_extraction:airlocalize:cut_tiles (1) [ 50%] 1 of 2
[0b/26c25f] process > spot_extraction:airlocalize:run_airlocalize (15) [ 11%] 7 of 60, failed: 7
[- ] process > spot_extraction:airlocalize:merge_points -
[- ] process > segmentation:predict [ 0%] 0 of 1
[- ] process > registration:cut_tiles [ 0%] 0 of 1
[- ] process > registration:fixed_coarse_spots [ 0%] 0 of 1
[- ] process > registration:moving_coarse_spots [ 0%] 0 of 1
[- ] process > registration:coarse_ransac -
[- ] process > registration:apply_transform_at_aff_scale -
[- ] process > registration:apply_transform_at_def_scale -
[- ] process > registration:fixed_spots -
[- ] process > registration:moving_spots -
[- ] process > registration:ransac_for_tile -
[- ] process > registration:interpolate_affines -
[- ] process > registration:deform -
[- ] process > registration:stitch -
[- ] process > registration:final_transform -
[20/b5b386] process > collect_merge_points:collect_merged_points_files (1) [100%] 1 of 1 ✔
[- ] process > warp_spots:apply_transform -
[- ] process > measure_intensities -
[- ] process > assign_spots -
Error executing process > 'spot_extraction:airlocalize:run_airlocalize (50)'
Caused by:
Process `spot_extraction:airlocalize:run_airlocalize (50)` terminated with an error exit status (139)
Command executed:
export SCRATCH_DIR=$PROCESS_DIR
echo "SCRATCH_DIR: $SCRATCH_DIR"
echo "/app/airlocalize/airlocalize.sh /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53/coords.txt /app/airlocalize/params/air_localize_default_params.txt /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53 _c0.txt"
/app/airlocalize/airlocalize.sh /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53/coords.txt /app/airlocalize/params/air_localize_default_params.txt /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53 _c0.txt
Command exit status:
139
Command output:
SCRATCH_DIR: /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work/3a/fce83861fc8fad0cc0c1210f89e27d
/app/airlocalize/airlocalize.sh /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53/coords.txt /app/airlocalize/params/air_localize_default_params.txt /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs/LHA3_R5_small/spots/tiles/53 _c0.txt
Creating MCR_CACHE_ROOT mcr_cache_209456
Running AirLocalize
Cleaned up temporary files at mcr_cache_209456
Command error:
File "/miniconda/lib/python3.6/site-packages/numpy/core/__init__.py", line 22, in <module>
from . import multiarray
File "/miniconda/lib/python3.6/site-packages/numpy/core/multiarray.py", line 12, in <module>
from . import overrides
File "/miniconda/lib/python3.6/site-packages/numpy/core/overrides.py", line 7, in <module>
from numpy.core._multiarray_umath import (
ImportError: PyCapsule_Import could not import module "datetime"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/airlocalize/scripts/air_localize_mcr.py", line 2, in <module>
import zarr
File "/miniconda/lib/python3.6/site-packages/zarr/__init__.py", line 2, in <module>
from zarr.codecs import *
File "/miniconda/lib/python3.6/site-packages/zarr/codecs.py", line 2, in <module>
from numcodecs import *
File "/miniconda/lib/python3.6/site-packages/numcodecs/__init__.py", line 27, in <module>
from numcodecs.zlib import Zlib
File "/miniconda/lib/python3.6/site-packages/numcodecs/zlib.py", line 5, in <module>
from .compat import ndarray_copy, ensure_contiguous_ndarray
File "/miniconda/lib/python3.6/site-packages/numcodecs/compat.py", line 7, in <module>
import numpy as np
File "/miniconda/lib/python3.6/site-packages/numpy/__init__.py", line 140, in <module>
from . import core
File "/miniconda/lib/python3.6/site-packages/numpy/core/__init__.py", line 48, in <module>
raise ImportError(msg)
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.6 from "/miniconda/bin/python"
* The NumPy version is: "1.19.4"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: PyCapsule_Import could not import module "datetime"
/app/airlocalize/airlocalize.sh: line 22: 209851 Segmentation fault python /app/airlocalize/scripts/air_localize_mcr.py $*
Work dir:
/local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work/3a/fce83861fc8fad0cc0c1210f89e27d
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
Nextflow.log should be attached below.
Environment
- EASI-FISH Pipeline version: 2.4.0? Freshly cloned from github and pull says it's up-to-date :)
- Nextflow version: 22.08.0-edge
- Container runtime: Singularity
- Platform: Local
- Operating system: Linux - RHEL 7
Additional context
As an additional context, I wanted to run some of my own images, but it failed. That's why I wanted to make sure I can run all the examples first. My own work fails at stitching step, where spark workers give Out of Memory error. I was hoping to reproduce this by using bigger examples, but to my suprise I got stuck at a different step :)
Thanks for reporting this. That is a very odd error, since the Singularity container is supposed to isolate the code and dependencies to prevent exactly this type of issue. I've never seen anything like this before, despite having run the demos on many different systems. It also doesn't make any sense why it would only happen on small and not tiny data. The code is the same in either case.
Until we can reproduce this, there's not much we can do, but we'll keep an eye on it.
In the meantime, can you try using RS-FISH for spot detection instead?
https://janeliascicomp.github.io/multifish/modules/RS-FISH.html
Thank you for your reply @krokicki . Indeed it is also puzzling me, I thought maybe the container was updated for tiny but not small or so, but I guess that's not the case. I am not sure if there are any home folder or environmental variables that might be causing this. If possible to pass parameters, I can try to run the containers --no-home
and --cleanenv
.
I have also tried with RS-FISH, which gave me a different kind of error. Sparkworker gave out of memory error. I think I see java Xmx to be set to 1 GB, and playing with worker memory or driver memory parameters didn't help. I attach the log files and I paste the stdout below. Thank you once again for your assistance :)
examples/demo_small.sh ../../new_demo/ --use_rsfish --rsfish_min 0 --rsfish_max 4096 --rsfish_anisotropy 0.7 --rsfish_sigma 1.5
N E X T F L O W ~ version 22.08.0-edge
Launching `./main.nf` [tender_ekeblad] DSL2 - revision: 5c93cdfd3b
===================================
EASI-FISH ANALYSIS PIPELINE
===================================
Pipeline parameters
-------------------
workDir : /local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work
data_manifest : demo_small
shared_work_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo
segmentation_model_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs/model/starfinity
data_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/inputs
output_dir : /local1/scratch/erkin/projects/vatsi_multifish/new_demo/outputs
publish_dir :
acq_names : [LHA3_R3_small, LHA3_R5_small]
ref_acq : LHA3_R3_small
steps_to_skip : []
executor > local (79)
[82/caa9c7] process > download (1) [100%] 1 of 1 ✔
[73/834616] process > stitching:prepare_stitching_data (2) [100%] 2 of 2 ✔
[80/814fb3] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (1) [100%] 2 of 2 ✔
[41/5d599b] process > stitching:stitch:spark_cluster:spark_master (2) [100%] 2 of 2 ✔
[13/6045e7] process > stitching:stitch:spark_cluster:wait_for_master (2) [100%] 2 of 2 ✔
[34/4c9778] process > stitching:stitch:spark_cluster:spark_worker (2) [100%] 2 of 2 ✔
[d3/b431eb] process > stitching:stitch:spark_cluster:wait_for_worker (2) [100%] 2 of 2 ✔
[fe/c77394] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2) [100%] 2 of 2 ✔
[9a/b8ef1f] process > stitching:stitch:run_czi2n5:spark_start_app (1) [100%] 2 of 2 ✔
[0e/c22d54] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[7d/3551c8] process > stitching:stitch:run_retile:spark_start_app (2) [100%] 2 of 2 ✔
[c0/cb34b3] process > stitching:stitch:run_stitching:spark_start_app (2) [100%] 2 of 2 ✔
[c5/e585a9] process > stitching:stitch:run_fuse:spark_start_app (2) [100%] 2 of 2 ✔
[ec/9b4340] process > stitching:stitch:terminate_stitching (2) [100%] 2 of 2 ✔
[5e/00a3bd] process > spot_extraction:rsfish:prepare_spots_dirs (2) [100%] 2 of 2 ✔
[9a/36a9c7] process > spot_extraction:rsfish:spark_cluster:prepare_spark_work_dir [100%] 1 of 1 ✔
[9e/d1f251] process > spot_extraction:rsfish:spark_cluster:spark_master [ 0%] 0 of 1
[5e/aedb70] process > spot_extraction:rsfish:spark_cluster:wait_for_master [100%] 1 of 1 ✔
[e2/55fb98] process > spot_extraction:rsfish:spark_cluster:spark_worker (3) [ 0%] 0 of 6
[e2/85b243] process > spot_extraction:rsfish:spark_cluster:wait_for_worker (2) [100%] 6 of 6 ✔
[d3/1ff87e] process > spot_extraction:rsfish:run_rsfish:spark_start_app (1) [ 0%] 0 of 2
[- ] process > spot_extraction:rsfish:terminate_rsfish -
[- ] process > spot_extraction:rsfish:postprocess_spots -
[30/8a2d61] process > segmentation:predict (1) [ 0%] 0 of 1
[7b/cfeb5c] process > registration:cut_tiles (1) [100%] 1 of 1 ✔
[26/eec29b] process > registration:fixed_coarse_spots (1) [100%] 1 of 1 ✔
[8c/8dff84] process > registration:moving_coarse_spots (1) [100%] 1 of 1 ✔
[d3/cc2e72] process > registration:coarse_ransac (1) [100%] 1 of 1 ✔
[8f/f5fec1] process > registration:apply_transform_at_aff_scale (1) [100%] 1 of 1 ✔
[7d/3da1c8] process > registration:apply_transform_at_def_scale (1) [100%] 1 of 1 ✔
[7f/6e98e5] process > registration:fixed_spots (5) [100%] 6 of 6 ✔
[d6/37c18e] process > registration:moving_spots (5) [100%] 6 of 6 ✔
[3f/b7faf0] process > registration:ransac_for_tile (6) [100%] 6 of 6 ✔
[48/f77ee5] process > registration:interpolate_affines (1) [100%] 1 of 1 ✔
[0e/f2d476] process > registration:deform (6) [ 0%] 0 of 6
[- ] process > registration:stitch -
[- ] process > registration:final_transform -
[85/f9db2b] process > collect_merge_points:collect_merged_points_files (1) [100%] 1 of 1 ✔
[- ] process > warp_spots:apply_transform -
[- ] process > measure_intensities -
[- ] process > assign_spots -
Error executing process > 'spot_extraction:rsfish:spark_cluster:spark_worker (4)'
Caused by:
executor > local (79)
[82/caa9c7] process > download (1) [100%] 1 of 1 ✔
[73/834616] process > stitching:prepare_stitching_data (2) [100%] 2 of 2 ✔
[80/814fb3] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (1) [100%] 2 of 2 ✔
[41/5d599b] process > stitching:stitch:spark_cluster:spark_master (2) [100%] 2 of 2 ✔
[13/6045e7] process > stitching:stitch:spark_cluster:wait_for_master (2) [100%] 2 of 2 ✔
[34/4c9778] process > stitching:stitch:spark_cluster:spark_worker (2) [100%] 2 of 2 ✔
[d3/b431eb] process > stitching:stitch:spark_cluster:wait_for_worker (2) [100%] 2 of 2 ✔
[fe/c77394] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (2) [100%] 2 of 2 ✔
[9a/b8ef1f] process > stitching:stitch:run_czi2n5:spark_start_app (1) [100%] 2 of 2 ✔
[0e/c22d54] process > stitching:stitch:run_flatfield_correction:spark_start_app (1) [100%] 2 of 2 ✔
[7d/3551c8] process > stitching:stitch:run_retile:spark_start_app (2) [100%] 2 of 2 ✔
[c0/cb34b3] process > stitching:stitch:run_stitching:spark_start_app (2) [100%] 2 of 2 ✔
[c5/e585a9] process > stitching:stitch:run_fuse:spark_start_app (2) [100%] 2 of 2 ✔
[ec/9b4340] process > stitching:stitch:terminate_stitching (2) [100%] 2 of 2 ✔
[5e/00a3bd] process > spot_extraction:rsfish:prepare_spots_dirs (2) [100%] 2 of 2 ✔
[9a/36a9c7] process > spot_extraction:rsfish:spark_cluster:prepare_spark_work_dir [100%] 1 of 1 ✔
[- ] process > spot_extraction:rsfish:spark_cluster:spark_master -
[5e/aedb70] process > spot_extraction:rsfish:spark_cluster:wait_for_master [100%] 1 of 1 ✔
[50/d57fd4] process > spot_extraction:rsfish:spark_cluster:spark_worker (1) [100%] 3 of 3, failed: 3
[e2/85b243] process > spot_extraction:rsfish:spark_cluster:wait_for_worker (2) [100%] 6 of 6 ✔
[- ] process > spot_extraction:rsfish:run_rsfish:spark_start_app (1) -
[- ] process > spot_extraction:rsfish:terminate_rsfish -
[- ] process > spot_extraction:rsfish:postprocess_spots -
[- ] process > segmentation:predict (1) -
[7b/cfeb5c] process > registration:cut_tiles (1) [100%] 1 of 1 ✔
[26/eec29b] process > registration:fixed_coarse_spots (1) [100%] 1 of 1 ✔
[8c/8dff84] process > registration:moving_coarse_spots (1) [100%] 1 of 1 ✔
[d3/cc2e72] process > registration:coarse_ransac (1) [100%] 1 of 1 ✔
[8f/f5fec1] process > registration:apply_transform_at_aff_scale (1) [100%] 1 of 1 ✔
[7d/3da1c8] process > registration:apply_transform_at_def_scale (1) [100%] 1 of 1 ✔
[7f/6e98e5] process > registration:fixed_spots (5) [100%] 6 of 6 ✔
[d6/37c18e] process > registration:moving_spots (5) [100%] 6 of 6 ✔
[3f/b7faf0] process > registration:ransac_for_tile (6) [100%] 6 of 6 ✔
[48/f77ee5] process > registration:interpolate_affines (1) [100%] 1 of 1 ✔
[- ] process > registration:deform (6) -
[- ] process > registration:stitch -
[- ] process > registration:final_transform -
[85/f9db2b] process > collect_merge_points:collect_merged_points_files (1) [100%] 1 of 1 ✔
[- ] process > warp_spots:apply_transform -
[- ] process > measure_intensities -
[- ] process > assign_spots -
Error executing process > 'spot_extraction:rsfish:spark_cluster:spark_worker (4)'
Caused by:
Process `spot_extraction:rsfish:spark_cluster:spark_worker (4)` terminated with an error exit status (1)
Command executed:
echo "Starting spark worker 4 - logging to /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log"
SESSION_FILE="/local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/.sessionId"
echo "Checking for $SESSION_FILE"
SLEEP_SECS=10
MAX_WAIT_SECS=7200
SECONDS=0
while ! test -e "$SESSION_FILE"; do
sleep ${SLEEP_SECS}
if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
echo "Waiting for $SESSION_FILE"
SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
else
echo "-------------------------------------------------------------------------------"
echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE "
echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
echo "-------------------------------------------------------------------------------"
exit 1
fi
done
if ! grep -F -x -q "cccc4a79-0287-40a0-ac0e-0a23b5686344" $SESSION_FILE
then
echo "------------------------------------------------------------------------------"
echo "ERROR: session id in $SESSION_FILE does not match current session "
echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
echo "and that you are not running multiple pipelines with the same --spark_work_dir"
echo "------------------------------------------------------------------------------"
exit 1
fi
rm -f /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log || true
export SPARK_ENV_LOADED=
export SPARK_HOME=/spark
export PYSPARK_PYTHONPATH_SET=
export PYTHONPATH="/spark/python"
export SPARK_LOG_DIR="/local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752"
export SPARK_WORKER_OPTS=" -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=30 -Dspark.worker.cleanup.appDataTtl=1 -Dspark.port.maxRetries=64"
. "/spark/sbin/spark-config.sh"
. "/spark/bin/load-spark-env.sh"
SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
echo "Use Spark IP: $SPARK_LOCAL_IP"
echo " /spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.17.0.1:7077 -c 8 -m 120G -d /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752 -h $SPARK_LOCAL_IP --properties-file /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/spark-defaults.conf "
/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.17.0.1:7077 -c 8 -m 120G -d /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752 -h $SPARK_LOCAL_IP --properties-file /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/spark-defaults.conf &> /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log &
spid=$!
trap "kill -9 $spid" EXIT
while true; do
if ! kill -0 $spid >/dev/null 2>&1; then
echo "Process $spid died"
exit 1
fi
if [[ -e "/local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/terminate-rsfish" ]] ; then
break
fi
sleep 1
done
Command exit status:
1
Command output:
Starting spark worker 4 - logging to /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/sparkworker-4.log
Checking for /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/.sessionId
Use Spark IP: 172.17.0.1
/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.17.0.1:7077 -c 8 -m 120G -d /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752 -h 172.17.0.1 --properties-file /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752/spark-defaults.conf
Process 63713 died
Command error:
INFO: Could not find any nv files on this host!
.command.sh: line 1: kill: (63713) - No such process
Work dir:
/local1/scratch/erkin/projects/vatsi_multifish/bin/multifish/work/01/ceb8676a2d508b5c1dd860c7b29e54
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
Your RS-FISH spark cluster size is shown in the nextflow.log:
Aug-11 13:22:43.061 [Actor Thread 126] DEBUG nextflow.Nextflow - Spark cluster started:
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow - Spark work directory: /local1/scratch/erkin/projects/vatsi_multifish/new_demo/spark/13ee615f-a5eb-4048-a3bc-7bafc7643752
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow - Number of workers: 6
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow - Cores per worker: 8
Aug-11 13:22:43.062 [Actor Thread 126] DEBUG nextflow.Nextflow - GB per worker core: 15
This is a large cluster with 6 workers * 120 GB = 720 GB total memory. According to the log, you were able to run stitching on this data with 1 * 4 * 12 = 48 GB total memory, so your cluster parameters for RS-FISH should be more than large enough. Usually the RS-FISH requires less memory than stitching, not more.
Going back to the error, java.lang.OutOfMemoryError: unable to create new native thread
looks like it should be related to memory, but the error is misleading. It's actually a system resource issue with lack of native threads.
It looks like you are running the pipeline on a single system, not a cluster, correct? Are you running anything else that might be using a lot of native threads on that system? You should also check the system limits to make sure there is a decent allowance on user processes, e.g. http://www.mastertheboss.com/jbossas/monitoring/how-to-solve-javalangoutofmemoryerror-unable-to-create-new-native-thread/
In terms of parameters, I would suggest reducing the number of workers and cores per worker, and increase only GB per worker core so that you have at least 48 GB total memory in the cluster. I usually start with the same parameters as the stitching cluster, and reduce the size from there.
Thank you once again for the reply and also the link! Indeed, I am running this on a single system, which is a rhel 7 server. I didn't think of thread limit at all, while we have enough cores and memory, there are many other people using the server so it is possible that we have reached a thread limit. I will experiment with this and give an update.