Pipeline for parrallel submission of BLASTX diamond jobs (SLURM scheduler)
Installation of Diamond v2.0.14 is required. On CSC Puhti, the tool is pre-installed in the module biokit).
If the system doesn't have Diamond install, a conda environment can be created as followed:
conda create -n diamond_env
conda activate diamond_env
conda install -c bioconda diamond
conda deactivate
Next, clone this repository on your working folder:
git clone https://github.com/aponsero/HPC_DiamondX.git
Diamond provides an easy and fast way to create a custom protein database of reference.
Beware, make sure to remove any extra spaces or trailing spaces from the file. Extra spaces or incorrectly formated fasta file will stop the database building job
sinteractive
module load biokit
diamond makedb --in [my_references.faa] -d [NAME_DATABASE] -p 4
If the command worked, you should now have a NAME_DATABASE.dmnd file created.
CSC has some database pre-indexed, see https://docs.csc.fi/apps/diamond/ for more informations
Inputs and output files:
-
[IN_DIR]= Input directory with the fasta samples to process
-
[OUT_DIR]= Output directory where to save the diamond output
Diamond configurations:
-
[DB]= custom or non-custom diamond database to use
-
[MAX_TAR]= max targets match per query
This imput file needs to contain the list of files you want to process. Each file on one line.
Open the 1_Diamond.sh file and add the correct project ID for billing. Submit the jobs as:
./Submit_Diamond.sh
filtering step to calculate relative abundance of protein hits