Step 1: Download the Non-redundant (NR) Proteins Database:
user@machine:~$ wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz'user@machine:~$ wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz.md5'Note: The database has 39 segments, it is initially ~100GB but ~450GB after extraction; the size changes frequently.
Step 2: Extract the Non-redundant (NR) Proteins Database:
user@machine:~$ gunzip nr.*.tar.gz
foriin$(seq 0 1 38);do### If segment is n then, the loop goes upto n-1.if [ $i-lt 10 ];then
tar -xvf nr.0$i.tar
else
tar -xvf nr.$i.tar
fidone
Step 3: Generate PSSM
###database='/home/learning/mrzResearchArena/NR/nr'# Please, set path where "nr" database directory is located.PSSM='/home/learning/mrzResearchArena/PSSM'# Please, set path where PSSM directory is located.core=8# multiprocessing.cpu_count()######importmultiprocessingimporttimeimportglobimportosos.chdir(PSSM)
######defrunPSIBLAST(file):
try:
os.system('psiblast -query {} -db {} -out {}.out -num_iterations 3 -out_ascii_pssm {}.pssm -inclusion_ethresh 0.001 -comp_based_stats 0 -num_threads 1'.format(file, database, file, file))
except:
print('PSI-BLAST is error for the sequence {}!'.format(file))
return'{}, is error.'.format(file)
return'{}, is done.'.format(file)
#end-def######begin=time.time()
pool=multiprocessing.Pool(processes=core)
results= [ pool.apply_async(runPSIBLAST, args=(file,)) forfileinglob.glob('*.fasta') ] # for x in range(1, 10)######outputs= [result.get() forresultinresults]
end=time.time()
######print(sorted(outputs))
print()
print('Time elapsed: {} seconds.'.format(end-begin))
###