instadeepai/InstaNovo

checkpoint model

hxxhust163 opened this issue · 6 comments

hi

Thanks for your work in developing Instanovo. I have installed it on my linux system following the installation instruction. But I encountered problems in using your software. Please help, thanks!

You provided the command line options like this:
"usage: python -m instanovo.transformer.predict data_path model_path [-h] [--denovo] [--config CONFIG] [--subset SUBSET] [--knapsack_path KNAPSACK_PATH] [--n_workers N_WORKERS]"

I wonder how to get the Model checkpoint? After I downloaded the 0.1.4 release and unzip the file.
I ran the following command:
" python -m instanovo.transformer.predict /data/SYSTEMHC/SYSMHC00023/HLA-I-p_MM16/SYSMHC00023_20141212_QEp7_MiBa_SA_HLA-I-p_MM16_1_B_convertion /data/xhuang/instanovo_yeast/instanovo_yeast_cfg/ --denovo --output_path /data/SYSTEMHC/SYSMHC00023/HLA-I-p_MM16/instanovo"

and got an error:
IsADirectoryError: [Errno 21] Is a directory: '/data/xhuang/instanovo_yeast/instanovo_yeast_cfg/'

So, how to correctly set the "model_path"? And is there an example data and command to show how to use the software? I think the demo will help a lot for users.
Thanks again!

best regards
xiaoxiang

Hi @hxxhust163, thanks for reaching out

You need to download instanovo.pt or instanovo_yeast.pt (click to download) from the 0.1.4 release page.
Then to run the prediction script, make sure you point to the instanovo.pt or instanovo_yeast.pt file for the model_path argument.

So, your command should look something like this:

python -m instanovo.transformer.predict /path/to/data.ipc /path/to/instanovo_yeast.ipc --denovo --output_path /path/to/output.csv

Note: the input should be a .ipc file, and the output should be a .csv file

Apologies the readme is a bit vague regarding the model checkpoints. This will be updated soon. For now, the only demo we have is in the Colab notebook here.

If you want to run InstaNovo on your own data, please use the data conversion scripts from the latest release. They allow you to convert from .mgf or .mzml to the required .ipc format used by InstaNovo in the predict function. In future versions we will do the conversion automatically, but for now you please see https://github.com/instadeepai/InstaNovo#using-your-own-datasets

Thanks for youe reply!

I converted the data used and download the instanovo_yeast.pt, and ran the command as follows:

python -m instanovo.transformer.predict /data/SYSTEMHC/SYSMHC00023/HLA-I-p_MM16/SYSMHC00023_20141212_QEp7_MiBa_SA_HLA-I-p_MM16_1_B_convertion /data/xhuang/instanovo_yeast.pt --denovo --output_path /data/SYSTEMHC/SYSMHC00023/HLA-I-p_MM16/instanovo.csv

But, I get the following log and it hangs at '0%' for hours, it is right? Indeed, I used the instanovo_yeast.pt instead of instanovo_yeast.ipc as the model, does it matters?

Thanks very much!

INFO:root:Initializing inference.
INFO:root:Loading data from /data/SYSTEMHC/SYSMHC00023/HLA-I-p_MM16/SYSMHC00023_20141212_QEp7_MiBa_SA_HLA-I-p_MM16_1_B_convertion
INFO:root:Data loaded, evaluating 100.0%, 59722 samples in total.
INFO:root:Knapsack path missing or not specified, generating...
INFO:root:Scaling masses.
INFO:root:Initializing chart.
INFO:root:Performing search.
0%| | 0/934 [00:00<?, ?it/s]

Maybe how can I convert instanovo_yeast.pt into instanovo_yeast.ipc? Thanks!

Hi, instanovo_yeast.pt is the correct model. Only the data should be in the .ipc format but from the logs it seems like it has loaded fine

The hanging could be due to not having a GPU, meaning it would be extremely slow to run, otherwise it is an issue in the dataset. Could you perhaps set --n_workers 0 as an additional flag when running the model? This should allow the dataloader to print out any errors.

Example:

python -m instanovo.transformer.predict /data/SYSTEMHC/SYSMHC00023/HLA-I-p_MM16/SYSMHC00023_20141212_QEp7_MiBa_SA_HLA-I-p_MM16_1_B_convertion /data/xhuang/instanovo_yeast.pt --denovo --output_path /data/SYSTEMHC/SYSMHC00023/HLA-I-p_MM16/instanovo.csv --n_workers 0

Hi

well, I add --n_workers 0 in the command line and it works!

INFO:root:Data loaded, evaluating 100.0%, 59722 samples in total.
INFO:root:Knapsack path missing or not specified, generating...
INFO:root:Scaling masses.
INFO:root:Initializing chart.
INFO:root:Performing search.
25%|███████████▉ | 237/934 [54:56<3:31:51, 18.24s/it]

But it is a bit slow, as you can see in the logs that only 25% has been done in 55 minutes. Indeed, I have a NVIDIA RTX 3080 GPU in my linux system. So, how can I accelerate the process? And what is meaning of --n_workers ?

By the way, I wonder how to use the Instanovo+. Simply, we can regard that instanovo+ goes one step further than instanovo. So, will you deploy them as two different softwares or integrate them together? I think a easy use pipeline would benefit the proteomic study, this is just a suggestion.

Thanks again for your work and your kindly reply! Wish the Instanovo a big success!

Hi,

Setting --n_workers 0 changes the num_workers of the PyTorch dataloader. When set to 0 ,the dataloader does not use multiprocessing and thereby allowing it to output error messages. It seems like it was freezing when you were using it in threading mode. This should be fixed in version 0.1.6 of InstaNovo.

The runtime is mostly affected by the knapsack beam-search and is often limited by CPU performance. We use knapsack to ensure that the mass of the predicted sequence matches that of the experimental precursor. You could turn this off in the code (there is currently no flag to set this in the prediciton script) and do plain beam-search or greedy search but the peptide accuracy would drop. If you want faster inference, you can optionally only use the InstaNovo+ model without conditioning on InstaNovo outputs. This will also drop performance slightly, but we still saw good performance on the yeast dataset.

For InstaNovo+, you can find the usage arguments by running this command (this will be added to the README soon):

python -m instanovo.diffusion.predict --help

Thanks for the interest in our work! I agree a pipeline would really help the end-user. We intend on making an all-in-one pipeline that does the dataconversion, InstaNovo, and InstaNovo+ all at once.