Error running a job on ShARC
Closed this issue · 3 comments
Issue from @groadabike
Gerardo is getting an error when submitting a job on ShARC with qsub
:
queue.pl: Error submitting jobs to queue (return status was 32512)
The job script as as follows:
#!/bin/bash
#$ -l gpu=1
#$ -P rse
#$ -q rse.q
#$ -j y
#$ -m bea
#$ -M groadabike1@sheffield.ac.uk
fullhost=`hostname -f`
if [[ ${fullhost} == *"sharc"* ]] ; then
module load apps/python/conda
source activate pys27 #myexperiment
module load libs/CUDA/7.5.18/binary
module load dev/gcc/5.4
fi
. ./path.sh || exit 1
. ./cmd.sh || exit 1
nj=4 # number of parallel jobs
lm_order=2 # language model order (n-gram quantity)
feature="mfcc" # Change by plp or mfcc for selected feature
mail=1 # Send mail with Cleanup results and current state of the log
cleanup=1 # Perform Cleanup
mail=1 # Send mail with Cleanup results and current state of the log
# 0 = No logs by mail
# 1 = Logs by mail
email="groadabike1@shef.ac.uk" # Email account used when mail=1
filename=${PWD##*/}
local=data/local
echo
echo "===== STARTING PROCESS $filename =====" | tr [a-z] [A-Z]
echo
[ ! -L "wav" ] && ln -s $DATA_ROOT
echo "Using steps and utils from wsj recipe"
[ ! -L "steps" ] && ln -s $KALDI_ROOT/egs/wsj/s5/steps
[ ! -L "utils" ] && ln -s $KALDI_ROOT/egs/wsj/s5/utils
utils/parse_options.sh || exit 1
[[ $# -ge 1 ]] && { echo "Wrong arguments!"; exit 1; }
# Prepare Acoustic Files, Features and Language Model
local/prepare_am_feature_lm.sh $nj $lm_order $feature
# Run different models from MONO to DNN
local/run_mono.sh $mail $nj $cleanup
local/run_tri1.sh $mail $nj $cleanup
local/run_tri2a.sh $mail $nj $cleanup
local/run_tri2b.sh $mail $nj $cleanup
local/run_tri3b.sh $mail $nj $cleanup
#echo "===== Create FMLLR features ====="
#local/run_raw_fmllr.sh $nj
echo "===== RUN DNN ====="
#local/nnet/run_dnn.sh $nj --stage=3
local/nnet/run_dnn_fbank.sh $nj
echo
echo "===== run.sh script is finished ====="
echo
@groadabike Are you able to run this interactively?
@willfurnass You're more familiar with bash script than me, is there anything that jumps out at you?
@groadabike: a few suggestions:
- Querying the environment variable
SGE_CLUSTER_NAME
is the easiest way to check whether your script is running onsharc
oriceberg
or something else. - Naming a variable
local
is not a good idea as it is a shell built-in function (check usingtype -a local
) - Where is
DATA_ROOT
defined? Might be an idea to check for undefined variables by addingset -u
near the top of your script (see https://www.davidpashley.com/articles/writing-robust-shell-scripts/). - On a related note, try adding the following near the top of your script to give you info about how yoru script fails if a command returns a non-zero exit status:
handle_error () {
errcode=$?
echo "Error code: $errcode"
echo "Errored command: " echo "$BASH_COMMAND"
echo "Error on line: ${BASH_LINENO[0]}"
exit $errcode
}
trap handle_error ERR
Out of curiosity, what do local/run_*.sh
do?
@twinkarma, I can't be able to test it interactively because, Kaldi is constructed in a way that you don't run 1 big job but run a several one hour jobs that must run in a sequence as the result of one job is a requirement of the next one. This means that I constantly adding jobs to the queue.
When I tried to run it interactively I didn't be able to finish the first subjob because the long time in the queue and I don't want to left an interactively session open all night.
- Querying the environment variable
SGE_CLUSTER_NAME
is the easiest way to check whether your script is running onsharc
oriceberg
or something else.
Thank you, I was using $fullhost
because I needed to know if I run it in my local computer
or in Iceberg/sharc
but I can modify that validation using the environment variable SGE_CLUSTER_NAME
which is better
-
Naming a variable
local
is not a good idea as it is a shell built-in function (check using type -a local)
Thank you so much -
Where is
DATA_ROOT
defined? Might be an idea to check for undefined variables by adding set -u near the top of your script (see https://www.davidpashley.com/articles/writing-robust-shell-scripts/).
DATA_ROOT
is defined inpath.sh
-
On a related note, try adding the following near the top of your script to give you info about how yoru script fails if a command returns a non-zero exit status:
Thanks, I will try that.
Out of curiosity, what do local/run_*.sh do?
Every kaldi's recipe has a local
directory. This is where all the scripts of the substeps of the speech recognition project are saved.
In my case, all the local/run_.sh
are the different GMM
scripts and the DNN
script
For example, local/run_mono.sh
is where I call tha GMM monophone training and alignment
Thank you so much
Another thought: if you have a 'control' job that executes 'compute' jobs then you might want to look at using a workflow manager to submit and supervise Grid Engine jobs; this may help keep your workflow management fairly clean.
- I know people at TUOS use Ruffus to define workflows and submit tasks to Grid Engine.
- I keep hearing good things about NextFlow, which can also use Grid Engine to run tasks.
- Thirdly, a simple option would be to use a basic Makefile to run a sequence of tasks in a way that tasks are only re-run if dependencies change or outputs are yet to be generated.