matthewwardrop/formulaic

ENH: Preserve variable order as they appear in formulas

Closed this issue · 5 comments

formulaic does not appear to reliable preserve variable order, so that that formulas

f1 = 'y ~ x1 + x2'
f2 = 'y ~ x2 + x1'

Generate the same DataFrame when run through model_matrix.

import pandas as pd
import numpy as np
from formulaic import model_matrix
from formulaic.model_spec import NAAction

rg = np.random.default_rng(0)
df = pd.DataFrame(rg.standard_normal((10,3)), columns=["y","x1","x2"])

f1 = 'y ~ x1 + x2'
f2 = 'y ~ x2 + x1'

mm1 = model_matrix(f1, df, na_action=NAAction("ignore"))
mm2 = model_matrix(f2, df, na_action=NAAction("ignore"))

print(mm1.rhs.columns)
print(mm2.rhs.columns)

Hi @bashtage ! Thanks for reaching out!

Formulaic made a design decision way-back-when to always sort terms and factors such that equivalent formulae behave identically and always generate the same results. You can read more about this here: https://matthewwardrop.github.io/formulaic/guides/formulae/#formula .

I can definitely see why this might be a bit annoying, though, if you are manually staring at regression reports, and are used to being able to search for features in the order you wrote them rather than lexically (though arguably it is easier to find lexically).

I'm willing to add support for disabling this, and perhaps even default to disabling this feature. I presume that the factors within a term could still be sorted?

Hi Matthew,

First of all, I want to express my gratitude for developing the 'formulaic' package. It has been incredibly convenient to use in conjunction with the linearmodels developed by Kevin.

I do have one suggestion for your package. Currently, when running linearmodels with the 'formulaic' package, the independent variables are ordered alphabetically by default. This differs from the 'statsmodel' package, which orders the variables as specified by the user in the formula. It would be much more user-friendly if the 'formulaic' package could also order the variables in the same way as 'statsmodel', as it can be "not intuitive" and sometimes confusing to loop over and extract specific parameters and statistics when the variables are ordered alphabetically.

I hope you find this suggestion helpful, and I appreciate your consideration in implementing this improvement in your package.

Thank you again for your hard work in developing the 'formulaic' package.

Best,
Dong Gil Kim

Hi @matthewwardrop,

I do think this would be a useful enhancement. For example, it is common to do something like

(a) The first few variables are actually of interest
(b) The remaining variables are controls that are not of specific interest.

I also think the expanding a specification can be tricky when the order is not preserved, since one has to figure out the position in the output to get the relevant coefficients.

Hi @bashtage and @DongGilKim ,

Thanks for your patience. I never have as much time as I'd like to work on my projects :).

I'm sold. I think this is just a case of me having a momentary idea years ago and never revisiting it. Keeping the terms in the same order as input (grouped by interaction order, like R and patsy) makes sense to me, and there is basically no benefit to having a guaranteed order to the terms. I'll fix this shortly, just in time for the 1.0.0 release. Thanks for catching this in time :).

fwiw: When looking up indices in a library I recommend using the model spec rather than based on input order, since terms may expand to multiple columns (which in turn may reduced in cardinality to keep things full rank).

I should have this done in a week or so.

Okay... so it wasn't too hard, and I had a bit of time to work on it... so I did. I'd love your thoughts about the remaining differences between patsy and formulaic ordering in #139 .