dimiboeckaerts/PhageRBPdetection

Issue with output - domains are all 0, but xgboost are not. Suspect issue with use of hmmpress/hmmscan

btemperton opened this issue · 2 comments

Hi,
Trying to install and run the standalone pipeline with the test dataset found in sequences.fasta.
My current output gives all 0s in domains_test_predictions.csv

preds
0
0
0

But gives different values in xgboost_test_predictions.csv

preds,score
1,0.9996086
1,0.99928814
1,0.9944469

The environment was created as follows:

conda create -n RBP -c conda-forge -c bioconda -y python=3.9 hmmer
conda activate RBP
pip install bio_embeddings[all]
pip install xgboost
pip install tensorflow-gpu

git clone https://github.com/dimiboeckaerts/PhageRBPdetection.git
cd PhageRBPdetection
python RBPdetect_standalone.py --dir data --hmmer_path ~/miniconda3/envs/RBP/bin

Output of pip freeze is attached.

Something weird is going on with your use of hmmpress and hmmscan. First, I'm not sure why you are switching to the directory containing hmmer to run those commands. If you're in a conda environment, they'll be on the path. If they're not, then you should be calling them from their directory e.g. something like

command = f'{hmm_path}/hmmpress ' + pfam_file

Second, hmmpress should create four files

Models pressed into binary file:   data/RBPdetect_phageRBPs.hmm.h3m
SSI index for binary model file:   data/RBPdetect_phageRBPs.hmm.h3i
Profiles (MSV part) pressed into:  data/RBPdetect_phageRBPs.hmm.h3f
Profiles (remainder) pressed into: data/RBPdetect_phageRBPs.hmm.h3p

Indeed, it does if you manually run

hmmpress data/RBPdetect_phageRBPs.hmm

But once the script has finished running, only one file from the hmmpress remains in the data folder - data/RBPdetect_phageRBPs.hmm.h3i

I can't figure out why three of the files are being deleted, but the remaining file also stops it being re-pressed on a following run.

RBP-pip.txt

I managed to fix the code to make it run and only run hmmpress if it was needed. I also

  1. removed the need to switch to the HMM directory
  2. Allowed the user to specify a input file, rather than rename stuff to sequences.fasta in the data folder
  3. Create a combined output file of both the domain search and the xgboost search. A protein record gets included in this file if it is detected by either or both methods.
  4. Allowed the user to prefix the output to enable running on multiple files in a bash loop. By default, it adds todays date to the output.
    RBPdetect_standalone.py.zip

I can't quite figure out how to make a pull request so I've just attached my code here.

Sample of combined
CPL00161-RBP-out.tbl.zip
output also attached

Hi Ben, thank you so much for reaching out about the problem you had and making fixes yourself to get it working! Indeed, the changes you propose make a lot of sense, I will integrate them into the repository and credit your contributions. Thanks a lot!