example parallel command usage for speed-up

Question

example parallel command usage for speed-up

Opened this issue 2 years ago · 2 comments

I used the following scheme to process 1000's of input proteins in a more realistic time.
maybe this can help others!

Please test if you have enough RAM when using multiple cores here!

# ECPred is installed for me at /opt/biotools/ECPred, edit for your own path
ECPRED_PATH=/opt/biotools/ECPred

# split the multifasta into single fasta files,one per protein (faSplit is from UCSC tools)
mkdir splitseqs
faSplit byname multi-proteins.fa splitseqs/

# run the prediction in parallel with N parallel jobs
pthr=48
mkdir results

find splitseqs -type f -name '*.fa' | \
  parallel -j ${pthr} -k 'java -jar ${ECPRED_PATH}/ECPred.jar \
    weighted {} \
    /${ECPRED_PATH} \
    $PWD \
    results/$(basename {})_out'

# collect and merge results
echo -e "Protein ID\tEC Number\tConfidence Score(max 1.0)" > ECPred_results.tsv
cat results/*_out | grep -v '^Protein' | sort -k 1V,1 >> ECPred_results.tsv

Answer 1 · 2023-02-24T15:05:16.000Z

Thank you so much for this comment!

Further, you can avoid the usage of the 3rd party tool 'faSplit from UCSC tools' with:
awk '/^>/ {OUT="splitseqs/" substr($0,2) ".fa"}; OUT {print >OUT}' multi-proteins.fa
Additionally, instead of parallel someone could also use xargs -P ${pthr} if parallel is not installed...

Answer 2 · 2023-11-28T18:52:49.000Z

Thanks for your help. I wanted to put the complete command of xargs here for your reference:
find splitseqs -type f -name '*.fasta' | \
xargs -P ${pthr} -I {} java -jar ${ECPRED_PATH}/ECPred.jar \
weighted {} \
${ECPRED_PATH} \
$PWD \
results/$(basename {})_out

#collect and merge results
echo -e "Protein ID\tEC Number\tConfidence Score(max 1.0)" > ECPred_results.tsv
cat results/*_out 2>/dev/null | grep -v '^Protein' | sort -k 1V,1 >> ECPred_results.tsv