bkelly-lab/ipca

Further Changes Roadmap

Closed this issue · 4 comments

I wanted to create a separate issue from #3 to discuss some potential further changes:

  1. In order to fully align with sklearn we need to distinguish the characteristics in X from the indices (time periods/assets).

    1.a Maybe the most natural way is to add an input called something like indices or groups. Then X would just be the chars, y would just be the returns, and indices would hold the time/asset labels. What I'm imagining here then is:

    fit(X=None, y=None, indices=None, ...)

    1.b This might provide a good route to direct integration with pandas. Our indices would align with the MultiIndex in pandas and we could have a method to break things out properly when given a pandas X and y with a MultiIndex.
    1.c sklearn also has a series of "multipleoutput" methods which somewhat align with what we're doing here. I need to read into this more but this might be one way to go.

  2. Ultimately, we should add an IPCARegressorCV class instead of the current fit_path method. I'll take care of this once we handle the indices issue. This is also where I can add the "hot-start" approach for cross-validation (which should be faster on most machines).

  3. It might make more sense to name the main class something like InstrumentedPCA instead of IPCARegressor. Two reasons for this:

    3.a This aligns with how other packages do it (e.g. IncrementalPCA) and allows us to better distinguish which IPCA we're working with here.
    3.b Regressor doesn't seem to add much information to the name.

One other thought here on the indices, we could set things up such that fit can either take a single panel matrix (comparable to what we were doing before) or a series of matrices with all the info broken out.

If fit receives the full panel, we'll handle the splitting internally, while sklearn could use the X, y, indices variant to align with their formatting.

  1. In order to fully align with sklearn we need to distinguish the characteristics in X from the indices (time periods/assets).
    1.a Maybe the most natural way is to add an input called something like indices or groups. Then X would just be the chars, y would just be the returns, and indices would hold the time/asset labels. What I'm imagining here then is:
    fit(X=None, y=None, indices=None, ...)

Looks good to me.

1.b This might provide a good route to direct integration with pandas. Our indices would align with the MultiIndex in pandas and we could have a method to break things out properly when given a pandas X and y with a MultiIndex.

Indeed, pandas integration would be great. This would allow to pass X as a multi-index dataframe and y as a multi-index dataframe / series. This could also come in handy with regards to handling missing obsevations in the predict & predictOOS methods when called externally.

  1. It might make more sense to name the main class something like InstrumentedPCA instead of IPCARegressor. Two reasons for this:
    3.a This aligns with how other packages do it (e.g. IncrementalPCA) and allows us to better distinguish which IPCA we're working with here.
    3.b Regressor doesn't seem to add much information to the name.

Agreed, IPCARegressor would not match well with the sklearn nomenclature.

Cool, I'll put this together then at some point in the near future. That way we can hopefully lock down the fit signature.

Closed by #5