SteffenMoritz/imputeTS

Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, : L-BFGS-B needs finite values of 'fn'

englianhu opened this issue · 23 comments

> data_tm1
# A tibble: 1,151,978 x 15
   index               BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose  year  week
   <dttm>                <dbl>   <dbl>  <dbl>    <dbl>   <dbl>   <dbl>  <dbl>    <dbl> <dbl> <dbl>
 1 2014-12-29 00:01:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 2 2014-12-29 00:02:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 3 2014-12-29 00:03:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 4 2014-12-29 00:04:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 5 2014-12-29 00:05:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 6 2014-12-29 00:06:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 7 2014-12-29 00:07:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 8 2014-12-29 00:08:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
 9 2014-12-29 00:09:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
10 2014-12-29 00:10:00    120.    120.   120.     120.    120.    120.   120.     120.  2015    53
# ... with 1,151,968 more rows, and 4 more variables: bias.open <dbl>, bias.high <dbl>,
#   bias.low <dbl>, bias.close <dbl>
> data_tm1_NA <- data_tm1 %>% 
+   dplyr::select(BidOpen, BidHigh, BidLow, BidClose, 
+                 AskOpen, AskHigh, AskLow,  AskClose) %>% 
+   prodNA(noNA = 0.01) %>% 
+   cbind(data_tm1[1], .) %>% tbl_df
> 
> data_tm1_1_tidyr <- data_tm1_NA %>% 
+   fill(BidOpen, BidHigh, BidLow, BidClose, 
+        AskOpen, AskHigh, AskLow, AskClose) %>% #default direction down
+   fill(BidOpen, BidHigh, BidLow, BidClose, 
+        AskOpen, AskHigh, AskLow, AskClose, .direction = 'up')
> 
> data_tm1_1_tidyr %>% anyNA
[1] FALSE
> 
> data_tm1_1_tidyr %<>% mutate(
+   bias.open = if_else(AskOpen>AskHigh|AskOpen<AskLow, 1, 0), 
+   bias.high = if_else(AskHigh<AskOpen|AskHigh<AskLow|AskHigh<AskClose, 1, 0), 
+   bias.low = if_else(AskLow>AskOpen|AskLow>AskHigh|AskLow>AskClose, 1, 0), 
+   bias.close = if_else(AskClose>AskHigh|AskClose<AskLow, 1, 0))
> 
> data_tm1_1_tidyr %>% 
+   dplyr::filter(bias.open==1|bias.high==1|bias.low==1|bias.close==1)
> 
> data_tm1_1_tidyr %<>% 
+   summarise(
+     AskOpen = mean((AskOpen - data_m1$AskOpen)^2), 
+     AskHigh = mean((AskHigh - data_m1$AskHigh)^2), 
+     AskLow = mean((AskLow - data_m1$AskLow)^2), 
+     AskClose = mean((AskClose - data_m1$AskClose)^2), 
+     Mean.HLC = (AskHigh + AskLow + AskClose)/3, 
+     Mean.OHLC = (AskOpen + AskHigh + AskLow + AskClose)/4, 
+     bias.open = sum(bias.open)/length(bias.open), 
+     bias.high = sum(bias.high)/length(bias.high), 
+     bias.low = sum(bias.low)/length(bias.low), 
+     bias.close = sum(bias.close)/length(bias.close)) %>% tbl_df
> 
> data_tm1_1_tidyr %>% 
+   kable(caption = 'MSE') %>% 
+   kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>%
+   scroll_box(width = '100%')#, height = '400px')
> data_m1_NA <- data_m1 %>% prodNA(noNA = 0.1)
> data_m1_10_impTS <- llply(algo, function(x) {
+   data_m1_NA %>% 
+     dplyr::select(starts_with('Ask'), starts_with('Bid')) %>% 
+     map(na.seadec, algorithm = x) %>% as.tibble
+   })
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,  : 
  L-BFGS-B needs finite values of 'fn'

I noticed that sometimes there will be error prompt me when I am using na.seadec(x, algorithm = x).

Hello englianhu, thanks a lot for opening an issue! 👍
I'll take a deeper look at it - don't know if I can fix the problem completely, since the main problem seems to lie with the StructTS function which is internally used by the function. But maybe I can alter the input given to this function, that it does not run into this error. Or at least I can manage to give a more meaningful error message with instruction on what to do differently to the end user.

@englianhu do you have small reproducible dataset, which you know for sure runs into this error?
Maybe I already have a good fix for the problem - just need some data to test if it really fixes the problem

data_m1_NA <- data_m1 %>% 
  dplyr::select(BidOpen, BidHigh, BidLow, BidClose, 
                AskOpen, AskHigh, AskLow,  AskClose) %>% 
  prodNA(noNA = 0.01) %>% 
  cbind(data_m1[1], .) %>% tbl_df

data_m1_1_impTS <- llply(algo, function(x) {
  data_m1_NA %>% 
    dplyr::select(starts_with('Ask'), starts_with('Bid')) %>% 
    map(na.seadec, algorithm = x) %>% as.tibble
  })
names(data_m1_1_impTS) <- algo
data_m1_1_impTS %<>% ldply %>% tbl_df

data_m1_1_impTS %<>% mutate(
  bias.open = if_else(AskOpen>AskHigh|AskOpen<AskLow, 1, 0), 
  bias.high = if_else(AskHigh<AskOpen|AskHigh<AskLow|AskHigh<AskClose, 1, 0), 
  bias.low = if_else(AskLow>AskOpen|AskLow>AskHigh|AskLow>AskClose, 1, 0), 
  bias.close = if_else(AskClose>AskHigh|AskClose<AskLow, 1, 0))

data_m1_1_impTS %>% 
  dplyr::filter(bias.open==1|bias.high==1|bias.low==1|bias.close==1)

data_m1_1_impTS %<>% 
  ddply(.(.id), summarise, 
        AskOpen = mean((AskOpen - data_m1$AskOpen)^2), 
        AskHigh = mean((AskHigh - data_m1$AskHigh)^2), 
        AskLow = mean((AskLow - data_m1$AskLow)^2), 
        AskClose = mean((AskClose - data_m1$AskClose)^2), 
        Mean.HLC = (AskHigh + AskLow + AskClose)/3, 
        Mean.OHLC = (AskOpen + AskHigh + AskLow + AskClose)/4, 
        bias.open = sum(bias.open)/length(bias.open), 
        bias.high = sum(bias.high)/length(bias.high), 
        bias.low = sum(bias.low)/length(bias.low), 
        bias.close = sum(bias.close)/length(bias.close)) %>% tbl_df

data_m1_1_impTS %>% 
  kable(caption = 'MSE') %>% 
  kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>%
  scroll_box(width = '100%')#, height = '400px')
- Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,  : 
-  L-BFGS-B needs finite values of 'fn'
- Calls: <Anonymous> ... apply.base.algorithm -> na.kalman -> StructTS -> optim
- In addition: There were 39 warnings (use warnings() to see them)

data1 : data_m1.zip
data2 : data_tm1.zip

I believed that is becasue of initial row of dataset contains value but below shows that is not the cause.

> data_tm1_1_impTS <- llply(algo, function(x) {
+   data_tm1_NA %>% 
+     dplyr::select(starts_with('Ask'), starts_with('Bid')) %>% 
+     map(na.seadec, algorithm = x) %>% as.tibble
+   })
- Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,  : 
-   L-BFGS-B needs finite values of 'fn'
> data_tm1_NA
# A tibble: 28,737 x 9
   index               BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose
   <dttm>                <dbl>   <dbl>  <dbl>    <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 2015-01-12 00:01:00    118.    118.   118.     118.    118.    118.   118.     118.
 2 2015-01-12 00:02:00    118.    118.   118.     118.    118.    118.   118.     118.
 3 2015-01-12 00:03:00    118.    118.   118.     118.    118.    118.   118.     118.
 4 2015-01-12 00:04:00    118.    118.   118.     118.    118.    118.   118.     118.
 5 2015-01-12 00:05:00     NA     118.   118.     118.    118.    118.   118.     118.
 6 2015-01-12 00:06:00    118.    118.   118.     118.    118.    118.   118.     118.
 7 2015-01-12 00:07:00    118.    118.   118.     118.    118.    118.   118.     118.
 8 2015-01-12 00:08:00    118.    118.   118.     118.    118.    118.   118.     118.
 9 2015-01-12 00:09:00    118.    118.   118.     118.    118.    118.   118.     118.
10 2015-01-12 00:10:00    118.    118.   118.     118.    118.    118.    NA      118.
# ... with 28,727 more rows
> data_tm1_NA %>% md.pattern
      index BidClose BidHigh AskOpen AskLow AskHigh BidLow BidOpen AskClose     
26520     1        1       1       1      1       1      1       1        1    0
292       1        1       1       1      1       1      1       1        0    1
287       1        1       1       1      1       1      1       0        1    1
4         1        1       1       1      1       1      1       0        0    2
276       1        1       1       1      1       1      0       1        1    1
5         1        1       1       1      1       1      0       1        0    2
1         1        1       1       1      1       1      0       0        1    2
272       1        1       1       1      1       0      1       1        1    1
2         1        1       1       1      1       0      1       1        0    2
4         1        1       1       1      1       0      1       0        1    2
1         1        1       1       1      1       0      0       1        1    2
267       1        1       1       1      0       1      1       1        1    1
1         1        1       1       1      0       1      1       1        0    2
3         1        1       1       1      0       1      0       1        1    2
3         1        1       1       1      0       0      1       1        1    2
253       1        1       1       0      1       1      1       1        1    1
5         1        1       1       0      1       1      1       1        0    2
3         1        1       1       0      1       1      1       0        1    2
3         1        1       1       0      1       1      0       1        1    2
2         1        1       1       0      1       0      1       1        1    2
6         1        1       1       0      0       1      1       1        1    2
248       1        1       0       1      1       1      1       1        1    1
3         1        1       0       1      1       1      1       1        0    2
2         1        1       0       1      1       1      1       0        1    2
3         1        1       0       1      1       1      0       1        1    2
1         1        1       0       1      1       0      1       1        1    2
3         1        1       0       1      0       1      1       1        1    2
5         1        1       0       0      1       1      1       1        1    2
241       1        0       1       1      1       1      1       1        1    1
1         1        0       1       1      1       1      1       1        0    2
2         1        0       1       1      1       1      1       0        1    2
5         1        0       1       1      1       1      0       1        1    2
6         1        0       1       1      1       0      1       1        1    2
2         1        0       1       1      0       1      1       1        1    2
4         1        0       1       0      1       1      1       1        1    2
1         1        0       0       1      1       1      1       1        1    2
          0      262     266     281    285     291    297     303      313 2298

I was now able to replicate the problem. Thanks for the data and code :)
But have no fix yet - hopefully I will have time for this in the next days.

Ok...I took a deeper look into this now:
The problem occurs for na.seadec(x, algorithm ="kalman") and na.kalman().

The root cause lies in a internal call of stats::StructTS() - which itself calls stats::optim, where the actual error occurs. optim has a parameter 'fn' which needs to have a finite value.
Somehow with this specific dataset leads to an Inf value in the call from StructTS.

I added a dataset here, which is just the time series needed to provoke the error.
With na.kalman(errorData) the error can be provoked.

I really do not get, why the error comes up exactly exactly for this specific dataset.
(since it comes from underlying packages I depend upon it is also hard to fix)

But a quick workaround is adding a additional parameter which is given to StructTS - type ="level" . :
na.kalman(errorData, type="level")
na.seadec(errorData, algorithm ="kalman", type="level")

With this type="level" parameter the error does not occur any more.

errorTS.RDA.zip

To sum up, if somebody has the same issue:

A quick workaround is adding a additional parameter which is given to StructTS - type ="level" . :
na.kalman(errorData, type="level")
na.seadec(errorData, algorithm ="kalman", type="level")

Please also drop me a mail - or open an issue (that I see how often people run into this).

I have this problem when trying to impute missing values in large time series with the na.kalman function. It seems like somewhere it gets a very high value and considers it as infinite. The proposed solution by SteffenMoritz #26 (comment) could be a quick solution for this problem. However sometimes the problem persists. When that happens, you can try to scale the time series to avoid getting so high values. See the following example with a large time series (86400 values).

sum(is.na(ts))
[1] 154

ts_kalman <- na.kalman(ts)
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'

ts_kalman <- na.kalman(ts, type = "level")
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'

ts_scaled <- scales:::rescale(ts, c(0, 1))
ts_kalman <- na.kalman(ts_scaled)

Warning message:
In StructTS(data, ...) :
possible convergence problem: 'optim' gave code = 52 and message ‘ERROR: ABNORMAL_TERMINATION_IN_LNSRCH’

ts_kalman <- na.kalman(ts_scaled, type = "level")

sum(is.na(ts_kalman))
[1] 0

ts_kalman <- scales:::rescale(ts_kalman, c(min(ts, na.rm = T), max(ts, na.rm = T)))

all.equal(ts[!is.na(ts)], ts_kalman[!is.na(ts)])
[1] TRUE

👍 Many thanks for your solution @kevinv21

tbs17 commented

hi guys, i'm having the same issue. Even with kevin's solution, I'm not able to get what i need. After i applied rescale function. R complains about can't rescale a time series object. @kevinv21 , can you show the steps before sum(is.na(ts))? is your ts a time series data in this sum() function. I can't seem to work around it...

Thanks for informing about the problem @tbs17. What kind of input object do you have?
(imputeTS accepts all kinds of inputs vector, ts, data.frame, zoo, tsibble)

I think the workaround of @kevinv21 only works with vector input.
(the scales:::rescale needs a vector)

Just transform your input to a vector and then try to run the workaround code again.

new_input <- as.vector(your_ts)

This will not affect the imputation, the timestamps are not important (since your time series is hopefully equi-distant).
You can afterwards transform to a ts object again. For a ts object this would work like this:

coredata(your_ts) <- imputed_vector

As @SteffenMoritz says, ts is a large time series of 86400 points represented as a vector of real values. I scaled it with the rescale function from the scales package, however, you can rescale your data with other functions/packages such as the scale function from the timeSeries packages, (or rescale it manually by using any approach for data standardization/normalization https://stackoverflow.com/questions/20256028/understanding-scale-in-r , https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range), or as @SteffenMoritz proposes you can transform your data into a vector.

tbs17 commented

Yeah, it looks like the non-changing value also produces the error, I have tried with the following time series and I get the error with the first one (ts1) but not with the second one (ts2):

ts1 <- c(5,5,5,5,5,5,5,5,NA,NA,5,5,5,5,5,5,5,5,5,5,NA,NA,5,5,5,NA)
na_kalman(ts1, type = "level")
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,  : 
  L-BFGS-B needs finite values of 'fn'

ts2 <- c(5,5,5,5,5,5,5,5,NA,NA,5,5,5,4,5,5,5,5,5,5,NA,NA,5,5,5,NA)
na_kalman(ts2, type = "level")
5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 4.952381 4.952381 5.000000 5.000000 5.000000 5.000000 4.000000 5.000000 5.000000 5.000000 5.000000 5.000000 4.952381 4.952381 5.000000 5.000000 5.000000 4.952381

Anyway if your time series contain repeated values, you can also try other kind of techniques such as linear interpolation or last observation carried forward, etc.

tbs17 commented

@tbs, can you provide an example time series you want to impute?

"But what I have is a time series data that have 5 NAs
and 25 same values."

This sounds like e.g.
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, NA, 4, 4, 4, NA, 4, 4, 4, NA, NA, 4, 4

Why would you expect a algorithm to impute increasing values there?
I mean there has to be at least an increasing trend somewhere visible that it would make sense to impute increasing values there.

If all values are 4 - expect the NAs - probably imputing 4 makes most sense.
(except you have prior knowledge that indicates something else)

In general, if you strictly want to follow your imputations a trend - the na.kalman method would be a good choice. (but of course only works, if the ARIMA model thinks there is a trend in the data).
Is the '25 same values' series just one of many time series you want to impute? Then use another algorithm for this (e.g. na.interpolation) and use na.kalman for all the other series.

You could also try to use, na.ma() - (moving average) with a high k parameter e.g. k = 7 or something like that. If there really is a strong trend in the data, small noise shouldn't lead to decreasing values with this setup.

If you really have only a series with '25 same values' and expect the NAs to be increasing - despite no indication shown in the data - you have to model this on your own. Since all algorithms only can extract information out of the data / they won't magically impute a trend that is not shown in the data. (yet there are some transformations you can make to force imputed values in a certain range)

tbs17 commented
tbs17 commented

Oh, probably didn't see this one.
About the own model and this question "Can I
just make an easy linear model based on the time series data"

You'd just define the model like this.

usermodel <- arima(tsAirgap, order = c(1, 0, 1))$model
na_kalman(tsAirgap, model = usermodel)

So first you specify your user specific ARIMA model and then you give it as parameter to na_kalman.
ARIMA stand for AR (autoregressive), I (integrated), MA (Moving Average).

So if you just want a simple linear model you might want so specify ARIMA(1,0,0).
Then you model would look like this
usermodel <- arima(tsAirgap, order = c(1, 0, 1))$model

Some more information about the "Error in optim ... " issue. Just had a new mail from a user that had this issue.

Turns out the problem here was also caused by a series that had only NAs and one non-changing value.

This is what Sigve Sørensen (thx for reporting!) wrote me:
"The error comes when there are series that contain only nans AND zero values (nan, nan, 0, 0, 0, nan … nan)"

So seems similar to what @kevinv21 wrote above. But, the type = "level" workaround as described by Kevin did not seem to work here.

So in general, if anybody also has this error, look out for time series with only one repeated measure. Sorry, that I don't have a fix yet, since the error comes from an underlying package. But you probably anyway can impute these series with only one repeated measure quite easily (since you would just replace all NAs by this one repeated measure). Because, if all values of a time series series are e.g. 2 ... you'd probably also expect the NAs to be 2.

Just got another mail with not exactly the same but a related issue.
(still have to check further details there)

The problem there is also:

possible convergence problem: 'optim' gave code = 52 and message �ERROR: ABNORMAL_TERMINATION_IN_LNSRCH�

Dear @SteffenMoritz SteffenMoritz, Thanks a lot for your contribution. I am using imputeTS to fill the missing value from the panel data. And I find some time series (such as 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, NA, 4, 4, 4, NA, 4, 4, 4, NA, NA, 4, 4) will lead to Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, : L-BFGS-B needs finite values of 'fn'. So I have to divide my panel data into two subsets and apply the different algorithms to subsets. Thanks a lot for your above answers.

Oh, sorry for answering so late.
Thx at @hezhichao1991 for reporting.

It is great having you all contribute to make the package better :)

Found the time to do some further checks:

Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, : L-BFGS-B needs finite values of 'fn'.

Always appears, when the series has no variation in values e.g. like you say 4,4,4,4,NA,4,NA
As soon that there is only one different value in the series everything works as expected.

Even c( 4, 4,4,4, NA,4.000001,4,4) works. It appears only if all values are exactly the same..

The reason lies in functions I am callen - can't change these.
But the solution might be obvious, think I just insert a check if the series is all constant values.

Looking at 4,4,4,4,NA,4,NA - it is quite sure, that the correct imputed value should also be a 4.

Fix will come with the next update!