Should input data be pre- or post- log-transformation?
Closed this issue · 4 comments
Given a data set with large numbers, GPfit runs into numerical overflow issues with exp().
x = [1200,
13000,
15000,
16000,
17000,
18000,
19000,
30000,
32000,
34000]
y = [325000,
250000,
750000,
2E6,
7E6,
750000,
8E6,
6E6,
2E6,
13E6,
]
Gives results like:
/Users/philippekirschen/Documents/MIT/Research/GPfit/gpfit/gpfit/fit.py:127: RuntimeWarning: overflow encountered in exp
w_SMA = exp(y_SMA)
/Users/philippekirschen/Documents/MIT/Research/GPfit/gpfit/gpfit/fit.py:130: RuntimeWarning: overflow encountered in exp
w = (exp(ydata)).T[0]
w**0.1 = 0 * (u_1)**34.9
+ inf * (u_1)**-5.28
+ 0 * (u_1)**198
Wondering if anything clever can be done here.
In the context of GP fitting, x
and y
are in logspace -- they're the log of the (positive) engineering quantities. Do you really have a quantity whose log is 15000? Or are you using GPfit for something other than GP fitting?
Wow... this was really careless, I was so convinced that GPfit was log transforming the data for me... why isn't GPfit transforming data for me? I feel like it should. Do you agree?
It's a design decision that's certainly still open for discussion.
The core fitting algorithms work on log-transformed data, so it might make sense to leave them as is, and create a trivial user-facing wrapper that does the input and output log transforms.
The core fitting stuff could be called "lsefit" (or isma, sma, etc); and have all references to "GP" fitting live in the wrapper.