Issue with output - domains are all 0, but xgboost are not. Suspect issue with use of hmmpress/hmmscan
btemperton opened this issue · 2 comments
Hi,
Trying to install and run the standalone pipeline with the test dataset found in sequences.fasta
.
My current output gives all 0s in domains_test_predictions.csv
preds
0
0
0
But gives different values in xgboost_test_predictions.csv
preds,score
1,0.9996086
1,0.99928814
1,0.9944469
The environment was created as follows:
conda create -n RBP -c conda-forge -c bioconda -y python=3.9 hmmer
conda activate RBP
pip install bio_embeddings[all]
pip install xgboost
pip install tensorflow-gpu
git clone https://github.com/dimiboeckaerts/PhageRBPdetection.git
cd PhageRBPdetection
python RBPdetect_standalone.py --dir data --hmmer_path ~/miniconda3/envs/RBP/bin
Output of pip freeze
is attached.
Something weird is going on with your use of hmmpress
and hmmscan
. First, I'm not sure why you are switching to the directory containing hmmer to run those commands. If you're in a conda
environment, they'll be on the path. If they're not, then you should be calling them from their directory e.g. something like
command = f'{hmm_path}/hmmpress ' + pfam_file
Second, hmmpress
should create four files
Models pressed into binary file: data/RBPdetect_phageRBPs.hmm.h3m
SSI index for binary model file: data/RBPdetect_phageRBPs.hmm.h3i
Profiles (MSV part) pressed into: data/RBPdetect_phageRBPs.hmm.h3f
Profiles (remainder) pressed into: data/RBPdetect_phageRBPs.hmm.h3p
Indeed, it does if you manually run
hmmpress data/RBPdetect_phageRBPs.hmm
But once the script has finished running, only one file from the hmmpress
remains in the data
folder - data/RBPdetect_phageRBPs.hmm.h3i
I can't figure out why three of the files are being deleted, but the remaining file also stops it being re-pressed on a following run.
I managed to fix the code to make it run and only run hmmpress
if it was needed. I also
- removed the need to switch to the HMM directory
- Allowed the user to specify a input file, rather than rename stuff to
sequences.fasta
in the data folder - Create a combined output file of both the domain search and the xgboost search. A protein record gets included in this file if it is detected by either or both methods.
- Allowed the user to prefix the output to enable running on multiple files in a bash loop. By default, it adds todays date to the output.
RBPdetect_standalone.py.zip
I can't quite figure out how to make a pull request so I've just attached my code here.
Sample of combined
CPL00161-RBP-out.tbl.zip
output also attached
Hi Ben, thank you so much for reaching out about the problem you had and making fixes yourself to get it working! Indeed, the changes you propose make a lot of sense, I will integrate them into the repository and credit your contributions. Thanks a lot!