sryza/spark-timeseries

Validating ARIMA fitting

witzatom opened this issue · 0 comments

I noticed a part of the log likelihood code that I didnt exactly trust that it was true to the theoretical background, so i wanted to validate that ARIMA fitModel returns reasonably good results.

I generated a ARMA time series in python using the statsmodels package:

np.random.seed(12345)
#see http://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_process.arma_generate_sample.html
arparams = np.array([1.25, -.25])
maparams = np.array([-.55, .15])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
sigma = 1
y = sm.tsa.arma_generate_sample(ar, ma, 100, sigma=sigma)

model = sm.tsa.ARMA(y, (2, 2)).fit(trend='nc', disp=0)
print(model.params)
print(repr(y))

And then used this series in spark-ts to fit.

val data = linalg.Vectors.dense(Array[Double](-0.20470766,  0.33564798, ... ))
val model = ARIMA.fitModel(2,0,2,data,false,"css-bobyqa")
println(model.coefficients.toList)

The resulting fit was:
List(0.3198554448811223, 0.628404313480156, 0.4955309624717912, 0.023362165310726737)

Or slightly formatted
c = 0, ar = [0.32, 0.63], ma = [0.50, 0.02] sigma = ??

As far as i know sigma can be estimated from the rest of the params, but even then the ar and ma params are very far from what they should be.

The part of the code that I found that prompted this investigation was:
https://github.com/sryza/spark-timeseries/blob/master/src/main/scala/com/cloudera/sparkts/models/ARIMA.scala#L443
val sigma2 = css / n

In general in theory sigma is part of the coefficients that are to be fitted, or am i wrong? Sometimes papers assume sigma to be 1, and I used such a sigma in the python code snippet aswell. There might be a bug in the log likelihood function in the code. Because of how variance (sigma2) is calculated in the log likelihood function, the objective function
(-n / 2) * math.log(2 * math.Pi * sigma2) - css / (2 * sigma2)
as far as optimization goes (removing monotone stuff and constants over optimization) can be simplified pretty much to
(-n / 2) * math.log(2 * math.Pi * sigma2) - css / (2 * sigma2) ~ (-n / 2) * math.log(2 * math.Pi * sigma2) ~ - math.log(sigma2) ~ - css
So intuitively this should still optimize into a fit with low css. I wouldnt really mind if it returned decent results, but referring to my test it doesnt seem exactly right. How does this work, why does the optimization not return parameters that are closer to the ones the series was generated from? Am i doing something wrong?

EDIT: The vector distance is high, but the log likelyhood of the parameters that spark-ts optimizes is pretty good, assuming the sigma=1. I would still appreciate if the estimated sigma or sigma2 was exposed as a property.