Further Changes Roadmap
Closed this issue · 4 comments
I wanted to create a separate issue from #3 to discuss some potential further changes:
-
In order to fully align with sklearn we need to distinguish the characteristics in
X
from the indices (time periods/assets).1.a Maybe the most natural way is to add an input called something like
indices
orgroups
. ThenX
would just be the chars,y
would just be the returns, andindices
would hold the time/asset labels. What I'm imagining here then is:fit(X=None, y=None, indices=None, ...)
1.b This might provide a good route to direct integration with pandas. Our
indices
would align with the MultiIndex in pandas and we could have a method to break things out properly when given a pandasX
andy
with a MultiIndex.
1.c sklearn also has a series of "multipleoutput" methods which somewhat align with what we're doing here. I need to read into this more but this might be one way to go. -
Ultimately, we should add an
IPCARegressorCV
class instead of the currentfit_path
method. I'll take care of this once we handle the indices issue. This is also where I can add the "hot-start" approach for cross-validation (which should be faster on most machines). -
It might make more sense to name the main class something like
InstrumentedPCA
instead ofIPCARegressor
. Two reasons for this:3.a This aligns with how other packages do it (e.g. IncrementalPCA) and allows us to better distinguish which IPCA we're working with here.
3.bRegressor
doesn't seem to add much information to the name.
One other thought here on the indices, we could set things up such that fit
can either take a single panel matrix (comparable to what we were doing before) or a series of matrices with all the info broken out.
If fit receives the full panel, we'll handle the splitting internally, while sklearn could use the X
, y
, indices
variant to align with their formatting.
- In order to fully align with sklearn we need to distinguish the characteristics in
X
from the indices (time periods/assets).
1.a Maybe the most natural way is to add an input called something likeindices
orgroups
. ThenX
would just be the chars,y
would just be the returns, andindices
would hold the time/asset labels. What I'm imagining here then is:
fit(X=None, y=None, indices=None, ...)
Looks good to me.
1.b This might provide a good route to direct integration with pandas. Our
indices
would align with the MultiIndex in pandas and we could have a method to break things out properly when given a pandasX
andy
with a MultiIndex.
Indeed, pandas integration would be great. This would allow to pass X
as a multi-index dataframe and y
as a multi-index dataframe / series. This could also come in handy with regards to handling missing obsevations in the predict
& predictOOS
methods when called externally.
- It might make more sense to name the main class something like
InstrumentedPCA
instead ofIPCARegressor
. Two reasons for this:
3.a This aligns with how other packages do it (e.g. IncrementalPCA) and allows us to better distinguish which IPCA we're working with here.
3.bRegressor
doesn't seem to add much information to the name.
Agreed, IPCARegressor
would not match well with the sklearn nomenclature.
Cool, I'll put this together then at some point in the near future. That way we can hopefully lock down the fit
signature.