Oktoberfest crashes with Error message "could not find any target"
mhamaneh opened this issue · 14 comments
Hi
I am trying to use Oktoberfest for rescoring several datasets. In most cases Oktoberfest runs with no issues, but in some cases it crashes with the error message that it could not find any targets. Apparently, this happens when Oktoberfest tries to run LDA (see Rescoring_2024_08_07-16_08_12.log). Not sure if this is a bug or I am doing something wrong.
Could you please take a look and advise on how to avoid this?
Best regards,
Mehdi
Sorry, I just realized that this is a CID dataset, but I am using the HCD model; maybe that is why it crashes. But I have another example where Oktoberfest crashes with an unknown error (see
Rescoring_2024_08_07-16_09_21.log)
Hi Mehdi,
I checked the log of your first run, and a few things are going on:
While there is an issue during the LDA fit for the retention time alignment, this is not the root cause of the issue. You are running oktoberfest 0.6.0 which still had a bug that was only affecting you if the LDA fit failed. However, it is actually the predictions in both of your runs that are causing trouble.
I.e. the oktoberfest run stops completely because some errneous messages was received from the prediction server. Technically, we have a retry mechanism in place in case the server fails to answer with the predictions, but if there is a general issue with the server, that does not work of course. I have forwarded this issue to Ludwig, who is maintaining Koina and asked him to check what is wrong.
For you, I recommend also updating to oktoberfest v0.6.2 and rerunning your run. If it was just a temporary server issue, it should resolve itself and Oktoberfest should skip all successfully performed steps and continue with the ones that failed / are still missing. However, it seems that it tried to convert raw to mzml when it failed and that may lead to a corrupted mzml file in which case you need to either rerun completely or find the corrupted mzml in the spectra subfolder and delete it. You can also control which steps are repeated by checking the proc folder and deleting the *.done files that store which steps are already done to force them being done again. I only recommend this though if your run includes a lot of raw files, otherwise it is probably easier simply rerunning from scratch.
Please let me know if runs through now. If not and it fails in the same predition step again, I would like to have a look at your data for reproduction of the bug if possible. In that case, please send me a mail for private communication.
Hi
Thanks for the suggestions. I installed the newest version (0.7), deleted the previous output folders to have clean runs, and re-ran oktoberfest for all of datasets. This time I had problems with more datasets than before. The previous version crashed in a few cases but most runs finished with no problems. But, with the new version installed, most runs crashed with the same error message seen in the second log file I posted here (see above). Since the error message seemed related to multiprocessing, I changed the "numThreads" parameter to 1, which worked in almost all cases but made oktoberfest very slow. Do you have any suggestions? Also, is there a way to bypass ce-calibration? It seems to be very time consuming.
Best,
Mehdi
Can you share your logfiles with me again? You can also send it via email if you feel more comfortable sharing it that way. This must be an issue on the Koina side. @LLautenbacher can you check the load and see if there are any issues at the moment? Also, it seems the error messages are tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Unknown error occured.
Therefore, I cannot trace this to anything on the client side.
Concering CE calibration, there is not a way to bypass this at the moment. I would also not recommend doing that, because the CE that is reported in the raw files is not necessarily the one you would get the best predictions for. It also should not take very long, since only the top 1000 target PSMs per raw file are used for the calibration, not the entire dataset. This taking very long is also likely related to issues on the server side.
Unfortunately, I did not save the log files. I deleted the output folders to start new runs to make sure everything was OK. Like I said I managed to get results with 1 processor. Thank you very much for your responses.
Btw, CE calibration is only performed for HCD data. For CID data, we didn't see any difference when providing a different CE to the model, so as long as you have CID data and you predict with Prosit_2020_intensity_CID
, it will skip the entire CE calibration step.
Unfortunately, I did not save the log files. I deleted the output folders to start new runs to make sure everything was OK. Like I said I managed to get results with 1 processor. Thank you very much for your responses.
Hm, I would say unfortunately, you have to go with one thread then, one run after another until we find the issue on our side :( Please send us logfiles for future failed runs. That way we can compare against the log files of the server for the given timestamp of the error.
Sure, I will. Thanks again.
Hello again,
Today, when I tried to run oktoberfest I got the following error message:
tritonclient.utils.InferenceServerException: The public koina network seems to be inaccessible at the moment. Please notify ludwig.lautenbacher@tum.de.
Could you please take a look into this?
Thanks
Hi, thanks for reaching out. Seems like Oktoberfest by default is still using an old server URL that is connected to proteomicsdb.org. We have some hardware stability issues with that. You probably tried to use Koina in an in-between state. While we work on updating Oktoberfests Koina integration, you can specify the current stable server URL (koina.wilhelmlab.org).
@LLautenbacher Judging by the log files, "koina.wilhelmlab.org:443" is explicitely set in the config file and the default is set to koina.wilhelmlab.org. So this cannot be the issue.
However, I have already created a branch for using the new koinapy client, that should make predictions more stable.
@mhamaneh Can you share the logifile you used and what timezone your PC is using so I can track when exactly you noticed this error? According to our server side logs the last "downtime" was on the 9th of August.
Yesterday, I tried oktoberfest again and it worked. I deleted the whole old output folder to start a fresh run, and so I do not have the log file to share. However, I still have difficulty running oktoberfest with multiple processes. I have already shared a log file with the error message. Please see the second log file shared above.