Setting `fit_peaks` prominence below 0.2 results in upper bounds with lower value than lower bounds.
Closed this issue · 16 comments
I am testing the application of fit_peaks
to my dataset, whose profile you can see below:
The behavior is as expected unless I set peak prominence below 0.2, at which point I receive an error from least_squares
informing me that at least one upper bound is less than the corresponding lower bound.
The error is below:
Traceback (most recent call last):
File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 125, in <module>
main()
File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 122, in main
HPLCPy(data)
File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 103, in __init__
peaks = chm.fit_peaks(
^^^^^^^^^^^^^^
File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/hplc/quant.py", line 801, in fit_peaks
peak_props = self.deconvolve_peaks(verbose=verbose,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/hplc/quant.py", line 671, in deconvolve_peaks
popt, _ = scipy.optimize.curve_fit(self._fit_skewnorms, v['time_range'],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/scipy/optimize/_minpack_py.py", line 974, in curve_fit
res = least_squares(func, p0, jac=jac, bounds=bounds, method=method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/scipy/optimize/_lsq/least_squares.py", line 814, in least_squares
raise ValueError("Each lower bound must be strictly less than each "
ValueError: Each lower bound must be strictly less than each upper bound.
The bounds in question are below, where 'lb' is lower bound, 'ub' is upper bound:
lb | ub | |
---|---|---|
0 | 0.000378979 | 0.0378979 |
1 | 0 | 0 |
2 | 0.0004896 | 0 |
3 | -inf | inf |
As I understand it, the prominence is the normalized prominence, i.e. the percentage of the maxima as a threshold for inclusion of a given peak. For the same dataset i've been using a value of 0.05 to ignore the smallest peaks with no problem.
Nevermind, as always, I should have checked my input data more thoroughly. I was inputting an augmented signal of stacked samples, I guess the repeating nature of the signal column was causing the issue.
Thanks for looking into that @jonathanstathakis. Please let me know how well hplc-py
works on your data!
Thanks for looking into that @jonathanstathakis. Please let me know how well
hplc-py
works on your data!
Will do. I've made a fork, and I've had to make a few modifications to achieve my goal. Would you rather I make a pull request with my modifications, or raise them as issues?
Feel free to raise issues here -- if you think a PR is worthwhile (i.e. makes a substantial & general contribution to the functioning of hplc-py
), go for it.
From my reading of your initial issue, are you trying to set bounds on peak prominence? If so, that's not quite how deconvolve_peaks
works. It applies bounds on the amplitude, location, scale, and skew (in that order). Given the chromatogram you showed, it seems like the amplitude bounds applied are not consistent with the scale of the data.
Are you passing the bounds you provided in the table as param_bounds
in fit_peaks
?
Thank you, will do!
Not exactly, I was merely calling fit_peaks
, the bounds I presented earlier were those generated in deconvolve_peaks
prior to being passed to scipy.optimize.curve_fit
. I believe I had prominence set quite high to speed up runtime at that point. Also, I was alternatively receiving the bound error above, or an error stating that p0 was out of bounds. But again, to generate this error I was passing an array of stacked signals end to end. And furthermore, on revisiting I am unable to recreate the error.
What I am trying to achieve is a means of joining the peak table and the raw signal to get the intensities of the identified peaks for downstream analysis. The time precision calculation is resulting in two significant figure retention time values which is too granular for me to use time as the join key. I've removed rounding throughout the process, but it appears that the estimated time does not appear precisely within the original signal. Would you be able to suggest a solution to this predicament? I am no expert on the methods used by your wonderful package so I'm learning as I go.
Got it, thanks. That's interesting that the periodic signal messed things up. I'd like to figure out why.
The time precision calculation comes from whatever the sampling interval is in your raw chromatogram (See lines 83-84 here. https://github.com/cremerlab/hplc-py/blob/8ce5223f2c1a4accbcafb3f883b8a692fac94337/hplc/quant.py#L83C13-L83C13). The chromatogram you show in your initial post looks like the time column is ordinal with the index (so your precision would be integer). Do you have a raw time for the trace, or just the indexes? Either way, this could be resolved through a PR where a number of sig figs is user supplied.
Also, note that when you run .fit_peaks
, the resulting peak table should give you the parameters for each detected peak in the chromatogram. What would you need for your downstream analysis? If you can give more context there, I'll be able to provide more help.
As there's not an open issue here, I'm going to close this (for now), but I'm happy to keep the thread going.
Thanks for keeping this going.
TBH after digging into your code I'm not sure why the bug happened either, further testing showed that the peak finding was simply restricted to the sample with the highest prominence and business proceeded as usual. I'll try to recreate it somewhere down the track..
So the overall goal is to join the peak table with the source signal dataframe, primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table. From there I am looking to automate variance analyses between sample classes, and multiclass ML modeling.
The simplest solution for me would be to output the index of each peak corresponding to its location in the input signal, alongside the retention time. Is that a reasonable expectation or does deconvolve_peaks
do some trickery I haven't anticipated?
The displayed signal was a quick visualisation, the actual input has a time axis with a sampling rate of 2 seconds per observation. As you might be able to tell, i've already performed smoothing and ASLS baseline subtraction.
Can you elaborate on what you mean by this?
primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table.
The peak table reports the best-fit values of the signals whose mixture reconstitute the observed chromatogram. If you know what compounds correspond to what retention times, you can add that as a dictionary in the map_peaks
method of the Chromatogram
object. That would give you labels. Sorry if I'm not understanding something obvious here.
If you've already done ASLS background subtraction, make sure you set correct_baseline=False
when calling fit_peaks
. This package uses SNIP baseline correction by default.
What was your rationale for using ASLS? If interested, that could be included as an alternative baseline correction algorithm.
Can you elaborate on what you mean by this?
primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table.
For the first part, I would intend to compare the results of an external peak profiling software to the results of fit_peaks
for external validation. For the second, I mean to use the 'retention_time' columns of both the original signal and the peak table outputted by fit_peaks
to join together the original signal table and the peak table, i.e. a SQL join. In that context the retention_time
column acts as a primary key where observation time is a unique identifier. The result would be a table with the raw signal in one column and all the columns from the peak table stretched out to form sparse columns, with values where the peak maxima was in the original signal, and empty values in all other locations.
If you've already done ASLS background subtraction, make sure you set correct_baseline=False when calling fit_peaks
Done
What was your rationale for using ASLS?
Nothing beyond the fact that a fair amount of literature that I have seen uses Whittaker baselines for chromatographic signals, and it performed well enough without over or underfitting. I used pybaselines, which provided more intricate approaches, but I thought it best to keep it simple if the result was acceptable. I am no expert in signal correction though, and I have not heard of SNIP before.
I am still in the process of understanding peak convolution, and I was wondering if I could ask whether it is sensible at all to expect the raw signal peak maxima time and deconvoluted peak maxima time to be equal or whether there is some variation inherent to the process?
I am still in the process of understanding peak convolution, and I was wondering if I could ask whether it is sensible at all to expect the raw signal peak maxima time and deconvoluted peak maxima time to be equal or whether there is some variation inherent to the process?
Nevermind, I just discovered the very thorough "How it Works" pages! I see now that the peak time does change after deconvolution, or at least has the capacity to, and therefore expecting the peak times to be the same is incorrect.
I am still wondering if its practical to expose the peak indices from the first 'find_peaks' call to provide a link between the raw signal and the fitted peaks? call it 'orig_peak_idx' or something.
Cool! Glad the "how it works" was useful. You should be able to get the peak indices from a private attributechrom._peak_indices
. That returns the indices of each peak on the cropped time domain.
Also, if it is helpful, I have a preprint outlining the major functionality of this software. The short answer is that with skewnormal distributions (which is what hplc-py
is fitting under the hood), the retention time (location parameter) does not always overlap with the peak signal. That is dependent on the skew parameter. This plot shows an amplitude weighted skew normal distribution with the same location and scale parameters, but varying skew parameters (alpha). Note that the peak location shifts away from 0 under high skewness.
Thank you for linking the preprint, much appreciated.
Unfortunately I have encountered issues with the default bounds again - it appears to be peak specific problems. The first issue was an initial guess for width lower than the lower bound (which means that the initial guess is a width lower than the time step), I rectified that by setting the lower bound to be 10% less than the time step, but then I encountered an error with a later peak whose initial guess for time range was larger than the upper bound for that same parameter.
I've been troubleshooting this for a while now and getting nowhere. Do have any recommendations on how to proceed from here?
As a follow up, here is more information about my current settings, and the window and peak in question:
fit_peaks
parameters
0 | |
---|---|
tbl_name | fit_peak_input_params |
tolerance | 0.5 |
prominence | 0.01 |
rel_height | 1 |
approx_peak_width | 0.7 |
buffer | 1 |
max_iter | 1000000 |
precision | 9 |
Window Parameters
tbl_name | window_idx | num_peaks | signal_area | |
---|---|---|---|---|
0 | window_info | 2 | 4 | 2735.48 |
Window Peak Fitting Parameters
tbl_name | window_idx | peak_idx | location | amplitude | width | params | lb | guess | ub | delta_lb | delta_ub | is_oob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | window_peak_fit_info | 2 | 1 | 2.6 | 327.845 | 0.157637 | amplitude | 32.7845 | 327.845 | 3278.45 | 295.061 | 2950.61 | 0 |
1 | window_peak_fit_info | 2 | 1 | 2.6 | 327.845 | 0.157637 | location | 1.9 | 2.6 | 4.43333 | 0.7 | 1.83333 | 0 |
2 | window_peak_fit_info | 2 | 1 | 2.6 | 327.845 | 0.157637 | scale | 0.0333333 | 0.0788184 | 1.26667 | 0.0454851 | 1.18785 | 0 |
3 | window_peak_fit_info | 2 | 1 | 2.6 | 327.845 | 0.157637 | skew | -inf | 0 | inf | inf | inf | 0 |
4 | window_peak_fit_info | 2 | 2 | 2.76667 | 165.211 | 0.0602344 | amplitude | 16.5211 | 165.211 | 1652.11 | 148.69 | 1486.9 | 0 |
5 | window_peak_fit_info | 2 | 2 | 2.76667 | 165.211 | 0.0602344 | location | 1.9 | 2.76667 | 4.43333 | 0.866667 | 1.66667 | 0 |
6 | window_peak_fit_info | 2 | 2 | 2.76667 | 165.211 | 0.0602344 | scale | 0.0333333 | 0.0301172 | 1.26667 | -0.00321612 | 1.23655 | 1 |
7 | window_peak_fit_info | 2 | 2 | 2.76667 | 165.211 | 0.0602344 | skew | -inf | 0 | inf | inf | inf | 0 |
8 | window_peak_fit_info | 2 | 3 | 3.3 | 12.5149 | 0.169807 | amplitude | 1.25149 | 12.5149 | 125.149 | 11.2635 | 112.635 | 0 |
9 | window_peak_fit_info | 2 | 3 | 3.3 | 12.5149 | 0.169807 | location | 1.9 | 3.3 | 4.43333 | 1.4 | 1.13333 | 0 |
10 | window_peak_fit_info | 2 | 3 | 3.3 | 12.5149 | 0.169807 | scale | 0.0333333 | 0.0849036 | 1.26667 | 0.0515703 | 1.18176 | 0 |
11 | window_peak_fit_info | 2 | 3 | 3.3 | 12.5149 | 0.169807 | skew | -inf | 0 | inf | inf | inf | 0 |
12 | window_peak_fit_info | 2 | 4 | 3.8 | 23.4608 | 0.235757 | amplitude | 2.34608 | 23.4608 | 234.608 | 21.1147 | 211.147 | 0 |
13 | window_peak_fit_info | 2 | 4 | 3.8 | 23.4608 | 0.235757 | location | 1.9 | 3.8 | 4.43333 | 1.9 | 0.633333 | 0 |
14 | window_peak_fit_info | 2 | 4 | 3.8 | 23.4608 | 0.235757 | scale | 0.0333333 | 0.117879 | 1.26667 | 0.0845453 | 1.14879 | 0 |
15 | window_peak_fit_info | 2 | 4 | 3.8 | 23.4608 | 0.235757 | skew | -inf | 0 | inf | inf | inf | 0 |
Error Peak
Focusing on the erroneous peak:
tbl_name | window_idx | peak_idx | location | amplitude | width | params | lb | guess | ub | delta_lb | delta_ub | is_oob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | window_peak_fit_info | 2 | 2 | 2.76667 | 165.211 | 0.0602344 | scale | 0.0333333 | 0.0301172 | 1.26667 | -0.00321612 | 1.23655 | 1 |
As I said before, the initial guess for the peak is less than the lower bound. The following is a visualisation of the window with the peaks and calculated width displayed:
So its fairly evident from the display that scipy.signal.peak_widths
is not accurately gauging the width of peak 2, causing the guess to be lower than the lower bound. I'll be continuing my investigation there, but I am hoping that this information will be useful to you. I would be keen to solve this so I can move on with my analysis.