Spark master process dies

Question

Spark master process dies

Closed this issue 3 years ago · 9 comments

Description of the problem

I'm running into the following error when trying to run the pipeline on the tiny demo dataset, and unfortunately I do not get why the process always stops. The command I execute to start is:

./examples/demo_tiny.sh /home/Testuser/Desktop/tm1101101/easifish-example-data

Environment

EASI-FISH Pipeline version: latest
Nextflow version: 21.04.3
Container runtime: Singularity
Platform: Local
Operating system: Ubuntu 20.04.3 LTS

Log file


~/Desktop/tm1101101/multifish$ ./examples/demo_tiny.sh /home/Testuser/Desktop/tm1101101/easifish-example-data


N E X T F L O W  ~  version 21.04.3
Launching `./main.nf` [silly_bartik] - revision: 4a0611e4e5

===================================
EASI-FISH ANALYSIS PIPELINE
===================================

Pipeline parameters
-------------------
workDir                : /home/Testuser/Desktop/tm1101101/multifish/work
data_manifest          : demo_tiny
shared_work_dir        : /home/Testuser/Desktop/tm1101101/easifish-example-data
segmentation_model_dir : /home/Testuser/Desktop/tm1101101/easifish-example-data/inputs/model/starfinity
data_dir               : /home/Testuser/Desktop/tm1101101/easifish-example-data/inputs
output_dir             : /home/Testuser/Desktop/tm1101101/easifish-example-data/outputs
publish_dir            : 
acq_names              : [LHA3_R3_tiny, LHA3_R5_tiny]
ref_acq                : LHA3_R3_tiny
steps_to_skip          : []

executor >  local (12)
[90/6c5835] process > download (1)                                              [100%] 1 of 1 ✔
[a4/a995c2] process > stitching:prepare_stitching_data (2)                      [100%] 2 of 2 ✔
[17/035fd4] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2) [100%] 2 of 2 ✔
[1f/f22b16] process > stitching:stitch:spark_cluster:spark_master (2)           [  0%] 0 of 2
[a7/d05965] process > stitching:stitch:spark_cluster:wait_for_master (2)        [ 50%] 1 of 2
[d2/ddbbe6] process > stitching:stitch:spark_cluster:spark_worker (1)           [  0%] 0 of 1
[e4/81db76] process > stitching:stitch:spark_cluster:wait_for_worker (1)        [  0%] 0 of 1
[-        ] process > stitching:stitch:run_parse_czi_tiles:spark_start_app      -
[-        ] process > stitching:stitch:run_czi2n5:spark_start_app               -
[-        ] process > stitching:stitch:run_flatfield_correction:spark_start_app -
[-        ] process > stitching:stitch:run_retile:spark_start_app               -
[-        ] process > stitching:stitch:run_rename_cmds                          -
[-        ] process > stitching:stitch:run_stitching:spark_start_app            -
[-        ] process > stitching:stitch:run_fuse:spark_start_app                 -
[-        ] process > stitching:stitch:terminate_stitching                      -
[-        ] process > spot_extraction:cut_tiles                                 -
[-        ] process > spot_extraction:airlocalize                               -
[-        ] process > spot_extraction:merge_points                              -
[-        ] process > segmentation:predict                                      -
[-        ] process > registration:cut_tiles                                    -
[-        ] process > registration:fixed_coarse_spots                           -
[-        ] process > registration:moving_coarse_spots                          -
[-        ] process > registration:coarse_ransac                                -
[-        ] process > registration:apply_transform_at_aff_scale                 -
[-        ] process > registration:apply_transform_at_def_scale                 -
[-        ] process > registration:fixed_spots                                  -
[-        ] process > registration:moving_spots                                 -
[-        ] process > registration:ransac_for_tile                              -
[-        ] process > registration:interpolate_affines                          -
[-        ] process > registration:deform                                       -
[-        ] process > registration:stitch                                       -
[-        ] process > registration:final_transform                              -
[37/c1c2d0] process > collect_merge_points:collect_merged_points_files (1)      [100%] 1 of 1 ✔
[-        ] process > warp_spots:apply_transform                                -
[-        ] process > measure_intensities                                       -
[-        ] process > assign_spots                                              -
Error executing process > 'stitching:stitch:spark_cluster:spark_master (2)'

Caused by:
  Process `stitching:stitch:spark_cluster:spark_master (2)` terminated with an error exit status (1)

Command executed:

  echo "Starting spark master - logging to /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/sparkmaster.log"
      
      SESSION_FILE="/home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/.sessionId"   
      echo "Checking for $SESSION_FILE"
      SLEEP_SECS=2
      MAX_WAIT_SECS=3600
      SECONDS=0
  
      while ! test -e "$SESSION_FILE"; do
          sleep ${SLEEP_SECS}
          if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
              echo "Waiting for $SESSION_FILE"
              SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
          else
              echo "-------------------------------------------------------------------------------"
              echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE    "
              echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
              echo "-------------------------------------------------------------------------------"
              exit 1
          fi
      done
  
      if ! grep -F -x -q "bf0d9829-0413-4012-805b-d255a7b0290c" $SESSION_FILE
      then
          echo "------------------------------------------------------------------------------"
          echo "ERROR: session id in $SESSION_FILE does not match current session            "
          echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
          echo "and that you are not running multiple pipelines with the same --spark_work_dir"
          echo "------------------------------------------------------------------------------"
          exit 1
      fi
      
  
executor >  local (12)
[90/6c5835] process > download (1)                                              [100%] 1 of 1 ✔
[a4/a995c2] process > stitching:prepare_stitching_data (2)                      [100%] 2 of 2 ✔
[17/035fd4] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2) [100%] 2 of 2 ✔
[d8/1f674b] process > stitching:stitch:spark_cluster:spark_master (1)           [100%] 1 of 1, failed: 1
[3f/0a9865] process > stitching:stitch:spark_cluster:wait_for_master (1)        [100%] 1 of 1
[-        ] process > stitching:stitch:spark_cluster:spark_worker (1)           -
[-        ] process > stitching:stitch:spark_cluster:wait_for_worker (1)        -
[-        ] process > stitching:stitch:run_parse_czi_tiles:spark_start_app      -
[-        ] process > stitching:stitch:run_czi2n5:spark_start_app               -
[-        ] process > stitching:stitch:run_flatfield_correction:spark_start_app -
[-        ] process > stitching:stitch:run_retile:spark_start_app               -
[-        ] process > stitching:stitch:run_rename_cmds                          -
[-        ] process > stitching:stitch:run_stitching:spark_start_app            -
[-        ] process > stitching:stitch:run_fuse:spark_start_app                 -
[-        ] process > stitching:stitch:terminate_stitching                      -
[-        ] process > spot_extraction:cut_tiles                                 -
[-        ] process > spot_extraction:airlocalize                               -
[-        ] process > spot_extraction:merge_points                              -
[-        ] process > segmentation:predict                                      -
[-        ] process > registration:cut_tiles                                    -
[-        ] process > registration:fixed_coarse_spots                           -
[-        ] process > registration:moving_coarse_spots                          -
[-        ] process > registration:coarse_ransac                                -
[-        ] process > registration:apply_transform_at_aff_scale                 -
[-        ] process > registration:apply_transform_at_def_scale                 -
[-        ] process > registration:fixed_spots                                  -
[-        ] process > registration:moving_spots                                 -
[-        ] process > registration:ransac_for_tile                              -
[-        ] process > registration:interpolate_affines                          -
[-        ] process > registration:deform                                       -
[-        ] process > registration:stitch                                       -
[-        ] process > registration:final_transform                              -
[37/c1c2d0] process > collect_merge_points:collect_merged_points_files (1)      [100%] 1 of 1 ✔
[-        ] process > warp_spots:apply_transform                                -
[-        ] process > measure_intensities                                       -
[-        ] process > assign_spots                                              -
Error executing process > 'stitching:stitch:spark_cluster:spark_master (2)'

Caused by:
  Process `stitching:stitch:spark_cluster:spark_master (2)` terminated with an error exit status (1)

Command executed:

  echo "Starting spark master - logging to /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/sparkmaster.log"
      
      SESSION_FILE="/home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/.sessionId"   
      echo "Checking for $SESSION_FILE"
      SLEEP_SECS=2
      MAX_WAIT_SECS=3600
      SECONDS=0
  
      while ! test -e "$SESSION_FILE"; do
          sleep ${SLEEP_SECS}
          if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
              echo "Waiting for $SESSION_FILE"
              SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
          else
              echo "-------------------------------------------------------------------------------"
              echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE    "
              echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
              echo "-------------------------------------------------------------------------------"
              exit 1
          fi
      done
  
      if ! grep -F -x -q "bf0d9829-0413-4012-805b-d255a7b0290c" $SESSION_FILE
      then
          echo "------------------------------------------------------------------------------"
          echo "ERROR: session id in $SESSION_FILE does not match current session            "
          echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
          echo "and that you are not running multiple pipelines with the same --spark_work_dir"
          echo "------------------------------------------------------------------------------"
          exit 1
      fi
      
  
      rm -f /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/sparkmaster.log || true
  
      
  mkdir -p /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny
  cat <<EOF > /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/spark-defaults.conf
  spark.rpc.askTimeout=300s
  spark.storage.blockManagerHeartBeatMs=30000
  spark.rpc.retry.wait=30s
  spark.kryoserializer.buffer.max=1024m
  spark.core.connection.ack.wait.timeout=600s
  spark.driver.maxResultSize=0
  spark.worker.cleanup.enabled=true
  spark.local.dir=/tmp
  EOF
  
      
      export SPARK_ENV_LOADED=
      export SPARK_HOME=/spark
      export PYSPARK_PYTHONPATH_SET=
      export PYTHONPATH="/spark/python"
      export SPARK_LOG_DIR="/home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny"
      
      . "/spark/sbin/spark-config.sh"
      . "/spark/bin/load-spark-env.sh"
      
      
      SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
      echo "Use Spark IP: $SPARK_LOCAL_IP"
      
  
      echo "    /spark/bin/spark-class org.apache.spark.deploy.master.Master     -h $SPARK_LOCAL_IP     --properties-file /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/spark-defaults.conf     "
  
      /spark/bin/spark-class org.apache.spark.deploy.master.Master     -h $SPARK_LOCAL_IP     --properties-file /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/spark-defaults.conf     &> /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/sparkmaster.log &
      spid=$!
  
      
      trap "kill -9 $spid" EXIT
  
      while true; do
  
          if ! kill -0 $spid >/dev/null 2>&1; then
              echo "Process $spid died"
              exit 1
          fi
  
          if [[ -e "/home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/terminate-stitching" ]] ; then
              break
          fi
  
          sleep 1
      done

Command exit status:
  1

Command output:
  Starting spark master - logging to /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/sparkmaster.log
  Checking for /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/.sessionId
  Use Spark IP: 127.0.1.1
      /spark/bin/spark-class org.apache.spark.deploy.master.Master     -h 127.0.1.1     --properties-file /home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/spark-defaults.conf     
  Process 1643231 died

Command error:
  .command.sh: line 1: kill: (1643231) - No such process

Work dir:
  /home/Testuser/Desktop/tm1101101/multifish/work/1f/f22b16b5e0fa545976e6b9cdbd32ae

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Answer 1 · 2021-11-14T00:02:11.000Z

Usually this happens when the system cannot see whether the master spark service was started. Please check or post this file: '/home/Testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R5_tiny/sparkmaster.log'

Answer 2 · 2021-11-14T12:04:47.000Z

Attached is sparkmaster.log. If I read it correctly, MasterUI does not find a port it can use. Do you think this is the problem? I'm also wondering why it only tries 16 times, while in my multifish/external-modules/spark/lib/param_utils.nf the parameter max_connect_retries is set to 64. Could there be any issue here? Or do you think I need to solve the problem otherwise?

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/11/14 11:51:02 INFO Master: Started daemon with process name: 3194709@workstationtwo
21/11/14 11:51:02 INFO SignalUtils: Registered signal handler for TERM
21/11/14 11:51:02 INFO SignalUtils: Registered signal handler for HUP
21/11/14 11:51:02 INFO SignalUtils: Registered signal handler for INT
21/11/14 11:51:02 WARN Utils: Your hostname, workstationtwo resolves to a loopback address: 127.0.1.1; using 172.18.12.33 instead (on interface enp0s31f6)
21/11/14 11:51:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/11/14 11:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/11/14 11:51:03 INFO SecurityManager: Changing view acls to: testuser
21/11/14 11:51:03 INFO SecurityManager: Changing modify acls to: testuser
21/11/14 11:51:03 INFO SecurityManager: Changing view acls groups to: 
21/11/14 11:51:03 INFO SecurityManager: Changing modify acls groups to: 
21/11/14 11:51:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(testuser); groups with view permissions: Set(); users  with modify permissions: Set(testuser); groups with modify permissions: Set()
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7078. Attempting port 7079.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7079. Attempting port 7080.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7080. Attempting port 7081.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7081. Attempting port 7082.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7082. Attempting port 7083.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7083. Attempting port 7084.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7084. Attempting port 7085.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7085. Attempting port 7086.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7086. Attempting port 7087.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7087. Attempting port 7088.
21/11/14 11:51:03 WARN Utils: Service 'sparkMaster' could not bind on port 7088. Attempting port 7089.
21/11/14 11:51:03 INFO Utils: Successfully started service 'sparkMaster' on port 7089.
21/11/14 11:51:04 INFO Master: Starting Spark master at spark://127.0.1.1:7089
21/11/14 11:51:04 INFO Master: Running Spark version 3.0.1
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8081. Attempting port 8082.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8082. Attempting port 8083.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8083. Attempting port 8084.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8084. Attempting port 8085.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8085. Attempting port 8086.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8086. Attempting port 8087.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8087. Attempting port 8088.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8088. Attempting port 8089.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8089. Attempting port 8090.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8090. Attempting port 8091.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8091. Attempting port 8092.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8092. Attempting port 8093.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8093. Attempting port 8094.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8094. Attempting port 8095.
21/11/14 11:51:04 WARN Utils: Service 'MasterUI' could not bind on port 8095. Attempting port 8096.
21/11/14 11:51:04 ERROR MasterWebUI: Failed to bind MasterWebUI
java.net.BindException: Failed to bind to /0.0.0.0:8096: Service 'MasterUI' failed after 16 retries (starting from 8080)! Consider explicitly setting the appropriate port for the service 'MasterUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
	at org.sparkproject.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:346)
	at org.sparkproject.jetty.server.ServerConnector.open(ServerConnector.java:308)
	at org.sparkproject.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
	at org.sparkproject.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
	at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.apache.spark.ui.JettyUtils$.newConnector$1(JettyUtils.scala:301)
	at org.apache.spark.ui.JettyUtils$.httpConnect$1(JettyUtils.scala:332)
	at org.apache.spark.ui.JettyUtils$.$anonfun$startJettyServer$5(JettyUtils.scala:335)
	at org.apache.spark.ui.JettyUtils$.$anonfun$startJettyServer$5$adapted(JettyUtils.scala:335)
	at org.apache.spark.util.Utils$.$anonfun$startServiceOnPort$2(Utils.scala:2256)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
	at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2248)
	at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:336)
	at org.apache.spark.ui.WebUI.bind(WebUI.scala:146)
	at org.apache.spark.deploy.master.Master.onStart(Master.scala:145)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Answer 3 · 2021-11-14T18:18:56.000Z

It looks like spark.port.maxRetries is only used for workers not for master and in your case it's the master that fails. However 16 (which is the default) should typically be enough. Are all those ports in use on your machine: from 8080 to 8096? If not maybe you have some firewall settings that prevents you from opening those ports.

Answer 4 · 2021-11-14T20:30:35.000Z

The Problem was indeed that the ports were blocked from previous runs! Killing previous Java processes solved this, thank you! However I'm running into the next problem I cannot really get behind: sparkworker-1.log does not exist and the process times out. This is only true for LHA3_R3_tiny, for LHA3_R5_tiny there exists a sparkworker-1.log. This file and both sparkmaster.log files do not show any errors. Any idea on that? Should I just manually create an empty sparkworker-1.log file?

N E X T F L O W  ~  version 21.04.3
Launching `./main.nf` [loving_bassi] - revision: 4a0611e4e5

===================================
EASI-FISH ANALYSIS PIPELINE
===================================

Pipeline parameters
-------------------
workDir                : /home/testuser/Desktop/tm1101101/multifish/work
data_manifest          : demo_tiny
shared_work_dir        : /home/testuser/Desktop/tm1101101/easifish-example-data
segmentation_model_dir : /home/testuser/Desktop/tm1101101/easifish-example-data/inputs/model/starfinity
data_dir               : /home/testuser/Desktop/tm1101101/easifish-example-data/inputs
output_dir             : /home/testuser/Desktop/tm1101101/easifish-example-data/outputs
publish_dir            : 
acq_names              : [LHA3_R3_tiny, LHA3_R5_tiny]
ref_acq                : LHA3_R3_tiny
steps_to_skip          : []

executor >  local (13)
[f1/67a4d3] process > download (1)                                              [100%] 1 of 1 ✔
[47/68893d] process > stitching:prepare_stitching_data (1)                      [100%] 2 of 2 ✔
[f9/5ed050] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2) [100%] 2 of 2 ✔
[c1/2c20ae] process > stitching:stitch:spark_cluster:spark_master (2)           [  0%] 0 of 2
[5b/b8610e] process > stitching:stitch:spark_cluster:wait_for_master (2)        [100%] 2 of 2 ✔
[40/d3dba8] process > stitching:stitch:spark_cluster:spark_worker (1)           [  0%] 0 of 2
[86/cfc71e] process > stitching:stitch:spark_cluster:wait_for_worker (1)        [ 50%] 1 of 2
[-        ] process > stitching:stitch:run_parse_czi_tiles:spark_start_app      -
[-        ] process > stitching:stitch:run_czi2n5:spark_start_app               -
[-        ] process > stitching:stitch:run_flatfield_correction:spark_start_app -
[-        ] process > stitching:stitch:run_retile:spark_start_app               -
[-        ] process > stitching:stitch:run_rename_cmds                          -
[-        ] process > stitching:stitch:run_stitching:spark_start_app            -
[-        ] process > stitching:stitch:run_fuse:spark_start_app                 -
[-        ] process > stitching:stitch:terminate_stitching                      -
[-        ] process > spot_extraction:cut_tiles                                 -
[-        ] process > spot_extraction:airlocalize                               -
[-        ] process > spot_extraction:merge_points                              -
[-        ] process > segmentation:predict                                      -
[-        ] process > registration:cut_tiles                                    -
[-        ] process > registration:fixed_coarse_spots                           -
[-        ] process > registration:moving_coarse_spots                          -
[-        ] process > registration:coarse_ransac                                -
[-        ] process > registration:apply_transform_at_aff_scale                 -
[-        ] process > registration:apply_transform_at_def_scale                 -
[-        ] process > registration:fixed_spots                                  -
[-        ] process > registration:moving_spots                                 -
[-        ] process > registration:ransac_for_tile                              -
[-        ] process > registration:interpolate_affines                          -
[-        ] process > registration:deform                                       -
[-        ] process > registration:stitch                                       -
[-        ] process > registration:final_transform                              -
[1c/b8b163] process > collect_merge_points:collect_merged_points_files (1)      [100%] 1 of 1 ✔
[-        ] process > warp_spots:apply_transform                                -
[-        ] process > measure_intensities                                       -
[-        ] process > assign_spots                                              -
Error executing process > 'stitching:stitch:spark_cluster:wait_for_worker (2)'

Caused by:
  Process `stitching:stitch:spark_cluster:wait_for_worker (2)` terminated with an error exit status (1)

Command executed:

  SESSION_FILE="/home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/.sessionId"   
  echo "Checking for $SESSION_FILE"
executor >  local (13)
[f1/67a4d3] process > download (1)                                              [100%] 1 of 1 ✔
[47/68893d] process > stitching:prepare_stitching_data (1)                      [100%] 2 of 2 ✔
[f9/5ed050] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (2) [100%] 2 of 2 ✔
[-        ] process > stitching:stitch:spark_cluster:spark_master (2)           -
[5b/b8610e] process > stitching:stitch:spark_cluster:wait_for_master (2)        [100%] 2 of 2 ✔
[40/d3dba8] process > stitching:stitch:spark_cluster:spark_worker (1)           [  0%] 0 of 1
[ce/4f1a75] process > stitching:stitch:spark_cluster:wait_for_worker (2)        [100%] 2 of 2, failed: 1 ✘
[-        ] process > stitching:stitch:run_parse_czi_tiles:spark_start_app      [  0%] 0 of 1
[-        ] process > stitching:stitch:run_czi2n5:spark_start_app               -
[-        ] process > stitching:stitch:run_flatfield_correction:spark_start_app -
[-        ] process > stitching:stitch:run_retile:spark_start_app               -
[-        ] process > stitching:stitch:run_rename_cmds                          -
[-        ] process > stitching:stitch:run_stitching:spark_start_app            -
[-        ] process > stitching:stitch:run_fuse:spark_start_app                 -
[-        ] process > stitching:stitch:terminate_stitching                      -
[-        ] process > spot_extraction:cut_tiles                                 -
[-        ] process > spot_extraction:airlocalize                               -
[-        ] process > spot_extraction:merge_points                              -
[-        ] process > segmentation:predict                                      -
[-        ] process > registration:cut_tiles                                    -
[-        ] process > registration:fixed_coarse_spots                           -
[-        ] process > registration:moving_coarse_spots                          -
[-        ] process > registration:coarse_ransac                                -
[-        ] process > registration:apply_transform_at_aff_scale                 -
[-        ] process > registration:apply_transform_at_def_scale                 -
[-        ] process > registration:fixed_spots                                  -
[-        ] process > registration:moving_spots                                 -
[-        ] process > registration:ransac_for_tile                              -
[-        ] process > registration:interpolate_affines                          -
[-        ] process > registration:deform                                       -
[-        ] process > registration:stitch                                       -
[-        ] process > registration:final_transform                              -
[1c/b8b163] process > collect_merge_points:collect_merged_points_files (1)      [100%] 1 of 1 ✔
[-        ] process > warp_spots:apply_transform                                -
[-        ] process > measure_intensities                                       -
[-        ] process > assign_spots                                              -
Error executing process > 'stitching:stitch:spark_cluster:wait_for_worker (2)'

Caused by:
  Process `stitching:stitch:spark_cluster:wait_for_worker (2)` terminated with an error exit status (1)

Command executed:

  SESSION_FILE="/home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/.sessionId"   
  echo "Checking for $SESSION_FILE"
  SLEEP_SECS=2
  MAX_WAIT_SECS=3600
  SECONDS=0
  
  while ! test -e "$SESSION_FILE"; do
      sleep ${SLEEP_SECS}
      if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
          echo "Waiting for $SESSION_FILE"
          SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
      else
          echo "-------------------------------------------------------------------------------"
          echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE    "
          echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
          echo "-------------------------------------------------------------------------------"
          exit 1
      fi
  done
  
  if ! grep -F -x -q "f1dac2ec-dac5-4f38-a4da-921781b9e9ea" $SESSION_FILE
  then
      echo "------------------------------------------------------------------------------"
      echo "ERROR: session id in $SESSION_FILE does not match current session            "
      echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
      echo "and that you are not running multiple pipelines with the same --spark_work_dir"
      echo "------------------------------------------------------------------------------"
      exit 1
  fi
  
  
  while true; do
  
      if [[ -e "/home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/sparkworker-1.log" ]]; then
          found=`grep -o "\(Worker: Successfully registered with master spark://127.0.1.1:7078\)" /home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/sparkworker-1.log || true`
  
          if [[ ! -z ${found} ]]; then
              echo "${found}"
              break
          fi
      fi
  
      if [[ -e "/home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/terminate-stitching" ]]; then
          echo "Terminate file /home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/terminate-stitching found"
          exit 1
      fi
  
      if (( ${SECONDS} > ${MAX_WAIT_SECS} )); then
          echo "Timed out after ${SECONDS} seconds while waiting for spark worker 1 for spark://127.0.1.1:7078 <- /home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/sparkworker-1.log"
          tail -25 /home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/sparkworker-1.log
          exit 2
      fi
  
      sleep ${SLEEP_SECS}
      SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
  
  done

Command exit status:
  1

Command output:
  Checking for /home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/.sessionId
  Timed out after 3604 seconds while waiting for spark worker 1 for spark://127.0.1.1:7078 <- /home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/sparkworker-1.log

Command error:
  tail: cannot open '/home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny/sparkworker-1.log' for reading: No such file or directory

Work dir:
  /home/testuser/Desktop/tm1101101/multifish/work/ce/4f1a750c4f7c6f572d79d5948ef3d9

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Answer 5 · 2021-11-22T14:34:34.000Z

I don't know if it helps but these are the screenshots of the Master and Worker UI. It seems that the worker is not doing anything... I attached the workers log file but it does not seem to show an error. Do you have an idea where I could look for debugging? Thanks!

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/11/22 14:21:27 INFO Worker: Started daemon with process name: 1819029@workstationtwo
21/11/22 14:21:27 INFO SignalUtils: Registered signal handler for TERM
21/11/22 14:21:27 INFO SignalUtils: Registered signal handler for HUP
21/11/22 14:21:27 INFO SignalUtils: Registered signal handler for INT
21/11/22 14:21:28 WARN Utils: Your hostname, workstationtwo resolves to a loopback address: 127.0.1.1; using 172.18.12.33 instead (on interface enp0s31f6)
21/11/22 14:21:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/11/22 14:21:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/11/22 14:21:28 INFO SecurityManager: Changing view acls to: testuser
21/11/22 14:21:28 INFO SecurityManager: Changing modify acls to: testuser
21/11/22 14:21:28 INFO SecurityManager: Changing view acls groups to: 
21/11/22 14:21:28 INFO SecurityManager: Changing modify acls groups to: 
21/11/22 14:21:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(testuser); groups with view permissions: Set(); users  with modify permissions: Set(testuser); groups with modify permissions: Set()
21/11/22 14:21:29 INFO Utils: Successfully started service 'sparkWorker' on port 36629.
21/11/22 14:21:29 INFO Worker: Starting Spark worker 127.0.1.1:36629 with 4 cores, 48.0 GiB RAM
21/11/22 14:21:29 INFO Worker: Running Spark version 3.0.1
21/11/22 14:21:29 INFO Worker: Spark home: /spark
21/11/22 14:21:29 INFO ResourceUtils: ==============================================================
21/11/22 14:21:29 INFO ResourceUtils: Resources for spark.worker:

21/11/22 14:21:29 INFO ResourceUtils: ==============================================================
21/11/22 14:21:29 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
21/11/22 14:21:29 INFO Utils: Successfully started service 'WorkerUI' on port 8082.
21/11/22 14:21:29 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://workstationtwo.abcd.ds:8082
21/11/22 14:21:29 INFO Worker: Connecting to master 127.0.1.1:7077...
21/11/22 14:21:29 INFO TransportClientFactory: Successfully created connection to /127.0.1.1:7077 after 51 ms (0 ms spent in bootstraps)
21/11/22 14:21:30 INFO Worker: Successfully registered with master spark://127.0.1.1:7077
21/11/22 14:21:30 INFO Worker: Worker cleanup enabled; old application directories will be deleted in: /home/testuser/Desktop/tm1101101/easifish-example-data/spark/LHA3_R3_tiny

Answer 6 · 2021-11-22T14:53:38.000Z

Check sparkworker-1.log and sparkworker-2.log as well as sparkmaster.log. You should see the succesfully connect in all logs. You can see in the log above Successfully registered with master spark://127.0.1.1:7077 There should be a corresponding message in the master log and a similar message in worker-2 log. Make sure that nobody uses those ports used by spark. From what you posted I see that one worker only failed. It is possible that the default settings do not match the resources you have on your machine in terms of memory and cores so the second worker cannot start because it does not have enough resources. I cannot see your exact command line but it seems to me you just used the demo_tiny so I can figure it from there (if that is not the case please post your command line). Also how many cores and how much memory do you have on your computer?

Answer 7 · 2021-11-22T17:41:17.000Z

Yes, I'm using demo tiny, the command line output is the one from here:
#4 (comment)

In LHA3_R3_tiny the file sparkmaster.log says

21/11/22 14:21:29 INFO Master: Registering worker 127.0.1.1:36629 with 4 cores, 48.0 GiB RAM

In LHA3_R3_tiny the file sparkworker-1.log says

21/11/22 14:21:30 INFO Worker: Successfully registered with master spark://127.0.1.1:7077

There does not exist a sparkworker-2.log in LHA3_R3_tiny.

For LHA3_R5_tiny there only exists a sparkmaster.log, no sparkworker-1.log or sparkworker-2.log. The sparkmaster.log did not register any worker.

The computer I'm working on has 40 cores and 64GB RAM. To me it seems the first worker starts, but does not execute anything and after one hour the process times out...

Answer 8 · 2021-11-22T21:44:41.000Z

I think that explains it. You have enough cores to use the defaults but you don't have enough memory. The defaults per demo_tiny are set to use 12GB per worker so that means you would require at least 96GB only for workers. Try to change the setting for "gb_per_core" from demo_tiny.json (currently set up to 12". Use only 3 GB per worker. With 4 GB you pretty much reach the limits on your machine. I think for demo_tiny might be enough but I don't know for sure since I never tested it on a machine with only 64GB RAM. Pay attention, when you run locally, it leaves up dangling java processes which unfortunately you have to kill manually. I could not come up with a solution that would leave no dangling processes in case the stitching process fails - when using a local executor. If you run this on a cluster like LSF this is not an issue (and probably it's not an issue on SGE or slurp either - but I have not tested it on those clusters)

Answer 9 · 2021-11-23T01:18:43.000Z

Setting gb_per_core to 3 GB per worker solved the problem, thank you!