RSE-Sheffield/GPUComputing

Error running a job on ShARC

Closed this issue · 3 comments

Issue from @groadabike

Gerardo is getting an error when submitting a job on ShARC with qsub:

queue.pl: Error submitting jobs to queue (return status was 32512)

The job script as as follows:

#!/bin/bash
#$ -l gpu=1
#$ -P rse
#$ -q rse.q
#$ -j y
#$ -m bea
#$ -M groadabike1@sheffield.ac.uk

fullhost=`hostname -f`


if [[ ${fullhost} == *"sharc"* ]] ; then
  module load apps/python/conda
  source activate pys27   #myexperiment
  module load libs/CUDA/7.5.18/binary 
  module load dev/gcc/5.4
fi

. ./path.sh || exit 1
. ./cmd.sh || exit 1

nj=4           	# number of parallel jobs
lm_order=2     	# language model order (n-gram quantity)
feature="mfcc" 	# Change by plp or mfcc for selected feature
mail=1  	# Send mail with Cleanup results and current state of the log
cleanup=1	# Perform Cleanup
mail=1          # Send mail with Cleanup results and current state of the log
		# 0 = No logs by mail
		# 1 = Logs by mail

email="groadabike1@shef.ac.uk"   # Email account used when mail=1

filename=${PWD##*/}

local=data/local
echo
echo "===== STARTING PROCESS $filename =====" | tr [a-z] [A-Z]
echo

[ ! -L "wav" ] && ln -s $DATA_ROOT

echo "Using steps and utils from wsj recipe"

[ ! -L "steps" ] && ln -s $KALDI_ROOT/egs/wsj/s5/steps
[ ! -L "utils" ] && ln -s $KALDI_ROOT/egs/wsj/s5/utils

utils/parse_options.sh || exit 1
[[ $# -ge 1 ]] && { echo "Wrong arguments!"; exit 1; } 



# Prepare Acoustic Files, Features and Language Model

local/prepare_am_feature_lm.sh $nj $lm_order $feature

# Run different models from MONO to DNN

local/run_mono.sh $mail $nj $cleanup

local/run_tri1.sh $mail $nj $cleanup

local/run_tri2a.sh $mail $nj $cleanup

local/run_tri2b.sh $mail $nj $cleanup

local/run_tri3b.sh $mail $nj $cleanup

#echo "===== Create FMLLR features ====="
#local/run_raw_fmllr.sh $nj


echo "===== RUN DNN ====="
#local/nnet/run_dnn.sh $nj --stage=3
local/nnet/run_dnn_fbank.sh $nj


echo
echo "===== run.sh script is finished ====="
echo

@groadabike Are you able to run this interactively?

@willfurnass You're more familiar with bash script than me, is there anything that jumps out at you?

@groadabike: a few suggestions:

  • Querying the environment variable SGE_CLUSTER_NAME is the easiest way to check whether your script is running on sharc or iceberg or something else.
  • Naming a variable local is not a good idea as it is a shell built-in function (check using type -a local)
  • Where is DATA_ROOT defined? Might be an idea to check for undefined variables by adding set -u near the top of your script (see https://www.davidpashley.com/articles/writing-robust-shell-scripts/).
  • On a related note, try adding the following near the top of your script to give you info about how yoru script fails if a command returns a non-zero exit status:
handle_error () {
    errcode=$?
    echo "Error code: $errcode" 
    echo "Errored command: " echo "$BASH_COMMAND" 
    echo "Error on line: ${BASH_LINENO[0]}"
    exit $errcode
}
trap handle_error ERR  

Out of curiosity, what do local/run_*.sh do?

@twinkarma, I can't be able to test it interactively because, Kaldi is constructed in a way that you don't run 1 big job but run a several one hour jobs that must run in a sequence as the result of one job is a requirement of the next one. This means that I constantly adding jobs to the queue.
When I tried to run it interactively I didn't be able to finish the first subjob because the long time in the queue and I don't want to left an interactively session open all night.

@willfurnass

  • Querying the environment variable SGE_CLUSTER_NAME is the easiest way to check whether your script is running on sharc or iceberg or something else.

Thank you, I was using $fullhost because I needed to know if I run it in my local computer or in Iceberg/sharc but I can modify that validation using the environment variable SGE_CLUSTER_NAME which is better

  • Naming a variable local is not a good idea as it is a shell built-in function (check using type -a local)
    Thank you so much

  • Where is DATA_ROOT defined? Might be an idea to check for undefined variables by adding set -u near the top of your script (see https://www.davidpashley.com/articles/writing-robust-shell-scripts/).
    DATA_ROOT is defined in path.sh

  • On a related note, try adding the following near the top of your script to give you info about how yoru script fails if a command returns a non-zero exit status:

Thanks, I will try that.

Out of curiosity, what do local/run_*.sh do?
Every kaldi's recipe has a local directory. This is where all the scripts of the substeps of the speech recognition project are saved.
In my case, all the local/run_.sh are the different GMM scripts and the DNN script
For example, local/run_mono.sh is where I call tha GMM monophone training and alignment

Thank you so much

Another thought: if you have a 'control' job that executes 'compute' jobs then you might want to look at using a workflow manager to submit and supervise Grid Engine jobs; this may help keep your workflow management fairly clean.

  • I know people at TUOS use Ruffus to define workflows and submit tasks to Grid Engine.
  • I keep hearing good things about NextFlow, which can also use Grid Engine to run tasks.
  • Thirdly, a simple option would be to use a basic Makefile to run a sequence of tasks in a way that tasks are only re-run if dependencies change or outputs are yet to be generated.