convexengineering/gpfit

Should input data be pre- or post- log-transformation?

Closed this issue · 4 comments

Given a data set with large numbers, GPfit runs into numerical overflow issues with exp().

x = [1200,
       13000,
       15000,
       16000,
       17000,
       18000,
       19000,
       30000,
       32000,
       34000]

y = [325000,
       250000,
       750000,
       2E6,
       7E6,
       750000,
       8E6,
       6E6,
       2E6,
       13E6,
       ]

Gives results like:

/Users/philippekirschen/Documents/MIT/Research/GPfit/gpfit/gpfit/fit.py:127: RuntimeWarning: overflow encountered in exp
  w_SMA = exp(y_SMA)
/Users/philippekirschen/Documents/MIT/Research/GPfit/gpfit/gpfit/fit.py:130: RuntimeWarning: overflow encountered in exp
  w = (exp(ydata)).T[0]


w**0.1 = 0 * (u_1)**34.9
    + inf * (u_1)**-5.28
    + 0 * (u_1)**198

Wondering if anything clever can be done here.

In the context of GP fitting, x and y are in logspace -- they're the log of the (positive) engineering quantities. Do you really have a quantity whose log is 15000? Or are you using GPfit for something other than GP fitting?

Wow... this was really careless, I was so convinced that GPfit was log transforming the data for me... why isn't GPfit transforming data for me? I feel like it should. Do you agree?

It's a design decision that's certainly still open for discussion.

The core fitting algorithms work on log-transformed data, so it might make sense to leave them as is, and create a trivial user-facing wrapper that does the input and output log transforms.

The core fitting stuff could be called "lsefit" (or isma, sma, etc); and have all references to "GP" fitting live in the wrapper.