Fallback mechanism needed if retention time alignment fails
Closed this issue · 3 comments
Describe the bug
If the input raw file does not contain enough correct PSMs, the retention time alignment fails with the following error:
...
2023-11-27 15:02:47,611 - INFO - spectrum_fundamentals.metrics.percolator::apply_lda_and_get_indices_below_fdr Found 0 targets and 6379 decoys as input for the LDA model
2023-11-27 15:02:47,671 - ERROR - spectrum_fundamentals.metrics.percolator::get_indices_below_fdr Could not find any targets below 0.07 out of 6183 targets in total
2023-11-27 15:02:47,671 - ERROR - spectrum_fundamentals.metrics.percolator::get_indices_below_fdr Could not find any targets below 0.07 out of 6183 targets in total
2023-11-27 15:02:47,684 - INFO - spectrum_fundamentals.metrics.percolator::get_indices_below_fdr Found 12 (out of 6183) targets below 0.08 FDR using spectral_angle as feature
2023-11-27 15:02:47,685 - INFO - spectrum_fundamentals.metrics.percolator::apply_lda_and_get_indices_below_fdr Found 12 targets and 6379 decoys as input for the LDA model
2023-11-27 15:02:47,761 - INFO - spectrum_fundamentals.metrics.percolator::get_indices_below_fdr Found 30 (out of 6183) targets below 0.08 FDR using lda_scores as feature
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /usr/local/lib/python3.8/runpy.py:194 in _run_module_as_main │
│ │
│ 191 │ main_globals = sys.modules["__main__"].__dict__ │
│ 192 │ if alter_argv: │
│ 193 │ │ sys.argv[0] = mod_spec.origin │
│ ❱ 194 │ return _run_code(code, main_globals, None, │
│ 195 │ │ │ │ │ "__main__", mod_spec) │
│ 196 │
│ 197 def run_module(mod_name, init_globals=None, │
│ │
│ /usr/local/lib/python3.8/runpy.py:87 in _run_code │
│ │
│ 84 │ │ │ │ │ __loader__ = loader, │
│ 85 │ │ │ │ │ __package__ = pkg_name, │
│ 86 │ │ │ │ │ __spec__ = mod_spec) │
│ ❱ 87 │ exec(code, run_globals) │
│ 88 │ return run_globals │
│ 89 │
│ 90 def _run_module_code(code, init_globals=None, │
│ │
│ /root/oktoberfest/__main__.py:37 in <module> │
│ │
│ 34 │
│ 35 if __name__ == "__main__": │
│ 36 │ traceback.install() │
│ ❱ 37 │ main() # pragma: no cover │
│ 38 │
│ │
│ /root/oktoberfest/__main__.py:32 in main │
│ │
│ 29 def main(): │
│ 30 │ """Execution of oktoberfest from terminal.""" │
│ 31 │ args = _parse_args() │
│ ❱ 32 │ runner.run_job(args.config_path) │
│ 33 │
│ 34 │
│ 35 if __name__ == "__main__": │
│ │
│ /root/oktoberfest/runner.py:474 in run_job │
│ │
│ 471 │ │ elif job_type == "CollisionEnergyCalibration": │
│ 472 │ │ │ run_ce_calibration(config_path) │
│ 473 │ │ elif job_type == "Rescoring": │
│ ❱ 474 │ │ │ run_rescoring(config_path) │
│ 475 │ │ else: │
│ 476 │ │ │ raise ValueError(f"Unknown job_type in config: {job_type}") │
│ 477 │ finally: │
│ │
│ /root/oktoberfest/runner.py:407 in run_rescoring │
│ │
│ 404 │ │ processing_pool.check_pool() │
│ 405 │ else: │
│ 406 │ │ for spectra_file in spectra_files: │
│ ❱ 407 │ │ │ _calculate_features(spectra_file, config) │
│ 408 │ │
│ 409 │ # prepare rescoring │
│ 410 │
│ │
│ /root/oktoberfest/runner.py:339 in _calculate_features │
│ │
│ 336 │ │ all_features=config.all_features, │
│ 337 │ │ regression_method=config.curve_fitting_method, │
│ 338 │ ) │
│ ❱ 339 │ re.generate_features( │
│ 340 │ │ library=library, │
│ 341 │ │ search_type="rescore", │
│ 342 │ │ output_file=fdr_dir / spectra_file.with_suffix(".rescore.tab").name, │
│ │
│ /root/oktoberfest/rescore/rescore.py:44 in generate_features │
│ │
│ 41 │ │ all_features_flag=all_features, │
│ 42 │ │ regression_method=regression_method, │
│ 43 │ ) │
│ ❱ 44 │ perc_features.calc() │
│ 45 │ perc_features.write_to_file(str(output_file)) │
│ 46 │
│ 47 │
│ │
│ /usr/local/lib/python3.8/site-packages/spectrum_fundamentals/metrics/percolator.py:417 in calc │
│ │
│ 414 │ │ │ │ ) │
│ 415 │ │ │ │
│ 416 │ │ │ file_sample = self.metadata.iloc[sampled_idxs].sort_values("PREDICTED_IRT") │
│ ❱ 417 │ │ │ aligned_predicted_rts = Percolator.get_aligned_predicted_retention_times( │
│ 418 │ │ │ │ file_sample["RETENTION_TIME"], │
│ 419 │ │ │ │ file_sample["PREDICTED_IRT"], │
│ 420 │ │ │ │ self.metadata["PREDICTED_IRT"], │
│ │
│ /usr/local/lib/python3.8/site-packages/spectrum_fundamentals/metrics/percolator.py:127 in │
│ get_aligned_predicted_retention_times │
│ │
│ 124 │ │ median_abs_error = 1.0 │
│ 125 │ │ │
│ 126 │ │ while discard_percentage < 50.0 and median_abs_error > 0.02: │
│ ❱ 127 │ │ │ params = fit_func(predicted_rts, observed_rts) │
│ 128 │ │ │ aligned_rts_predicted = params[0] │
│ 129 │ │ │ │
│ 130 │ │ │ abs_errors = np.abs(aligned_rts_predicted - observed_rts) │
│ │
│ /usr/local/lib/python3.8/site-packages/spectrum_fundamentals/metrics/percolator.py:466 in │
│ <lambda> │
│ │
│ 463 │ elif curve_fitting_method == "lowess": │
│ 464 │ │ return lambda x, y: (lowess.lowess_fit_and_predict(x, y, frac=0.5),) │
│ 465 │ elif curve_fitting_method == "spline": │
│ ❱ 466 │ │ return lambda x, y: spline(2, x, y) │
│ 467 │ else: │
│ 468 │ │ raise ValueError("curve_fitting_method should be one of the following strings: l │
│ 469 │
│ │
│ /usr/local/lib/python3.8/site-packages/spectrum_fundamentals/metrics/percolator.py:475 in spline │
│ │
│ 472 │ """Calculates spline fitting.""" │
│ 473 │ x_new = np.linspace(0, 1, knots + 2)[1:-1] │
│ 474 │ q_knots = np.quantile(x, x_new) │
│ ❱ 475 │ t, c, k = interpolate.splrep(x, y, t=q_knots, s=2) │
│ 476 │ yfit = interpolate.BSpline(t, c, k)(x) │
│ 477 │ return yfit, t, c, k │
│ 478 │
│ │
│ /usr/local/lib/python3.8/site-packages/scipy/interpolate/_fitpack_py.py:288 in splrep │
│ │
│ 285 │ >>> plt.show() │
│ 286 │ │
│ 287 │ """ │
│ ❱ 288 │ res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet) │
│ 289 │ return res │
│ 290 │
│ 291 │
│ │
│ /usr/local/lib/python3.8/site-packages/scipy/interpolate/_fitpack_impl.py:508 in splrep │
│ │
│ 505 │ │ │ warnings.warn(RuntimeWarning(_iermess[ier][0])) │
│ 506 │ │ else: │
│ 507 │ │ │ try: │
│ ❱ 508 │ │ │ │ raise _iermess[ier][1](_iermess[ier][0]) │
│ 509 │ │ │ except KeyError as e: │
│ 510 │ │ │ │ raise _iermess['unknown'][1](_iermess['unknown'][0]) from e │
│ 511 │ if full_output: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Error on input data
To Reproduce
Steps to reproduce the behavior:
- Run oktoberfest rescoring on an msms.txt file without correct PSMs
Expected behavior
The retention time alignment should fallback on a simple retention time alignment (e.g. linear regression) and run through to the end of the pipeline.
System [please complete the following information]:
- OS: Docker image
- Language Version: Python 3.8.12
- Virtual environment: Docker image
agree, I noticed this already. How about printing a warning instead and generally reduce the log output a bit at the info level and move the rest to debug level. Also this error log due to not found any PSMs below FDR cutoff should maybe be just a warning, then fallback to linear?
Yes, agreed on reducing the log levels in all places.
We probably want to keep the iterative increase of the FDR cutoff, but we should find out a good way to go to the linear fallback instead of running into the ValueError. Probably we need to check for a minimum number of input nodes to the regression?
I have investigate this issue more and it seems this happens when the [.33,.66] quantiles, which we use as knots for the spline fit are the actual outer points. I.e. spline only wants the two inner knots, then it adds the two outer knots to get three segments, for example [0, 30, 60, 100], where 30, 60 would be the calculated points for the quantiles given. Now if there are many predicted irts at a low point, that make up actually more than 33% of the values, the calculated lower knot is the same as the outer lower knot, you get sth. like [0,0,60,100]. This is not allowed!
I am currently testing a fix that adds a small epsilon 1e-6 to the lower knot, and removes that from the upper knot, this should prevent this. I tested it and while it takes a few more iterations to remove enough percentiles for the bottom to get resolved, the fit eventually gets better. Here you see an example, where the predicted irt is on the x-axis (this is how we fit, i.e. aligned-observed RT as a function of pred irt):
You can see that the knot on the bottom leads to a short segment, which then leads to a poor fit, however, a few iterations later and it is resolved by removing more and more dots using an increasing percentile that is removed at both ends:
This is pretty stable, as it removes points at either end until either 50% of points are left or the MAE is < 0.2.