matthewwardrop/formulaic

Handling individual columns that can expand into multiple columns

Opened this issue · 7 comments

hi, thanks for the library!

I was wondering how best to handle a particular use-case, where a single column can "expand" into multiple categorical factors. The specific context is when analyzing datasets containing information about the effect of genetic mutations on the function of a protein (or other genetic targets). When there's a single starting sequence (sometimes called a "parent" or "wild-type"), it makes sense to represent these mutations as being relative to that parent sequence. So for a single mutation you might have a string like "A11K" which means changing the parent sequences alanine residue ("A") at site 11 to a lysine ("K"). Then if you have multiple mutations for a single protein, it might look something like "A11K;S28T" for that version of the protein.

So an example, slightly modified from the published data of this paper, would look like this (subset of the full table):

mutations activity
A108T:I150V 2.63384474
A108T:K129E:S145C:E170G:T184S 1.31862971
A108T:K156R:F163L:V174A:Q202L:K207E 1.30103208
A108T:Q202R :M216L:V222A 1.30103059

The most typical way to use this information in machine learning is a one-hot encoding for each possible mutation. But I'd like to be able to leverage formulaic to:

  • have more control on these encodings beyond a simple one-hot encoding
  • more easily integrate mutational encoding in larger formulae with other terms
  • take advantage of model_spec resuse on new datasets to have a consistent encoding
  • leverage the ability of formulaic to generate higher-order interactions between different factors, as this is often a goal of statistical analyses of these data

So I'm trying to figure out how best fit this into a pipeline relying on formulaic. There are essentially two steps that have to be done in sequence:

  1. Expand the condensed representation of individual mutations into a per-position representation, essentially a different categorical column for each possible mutation position
  2. Build a formula that categorically encodes each individual site.

I could already do this "by hand" by transforming the input data for step 1 and then feeding this new matrix into a model_matrix call. E.g. something like

model_matrix("C(site108) + C(site129) + ... ", transformed(data))

but the challenge here is somewhat already apparent: there's a large number of categorical factors (one for each site) that will be generated (it's not uncommon to have hundreds or thousands of sites mutated in a given dataset). So, the bookkeeping on these sites, and generating an equation that represents each one, can be somewhat unruly. There's two potential solutions I had in mind, and wanted to know what makes the most sense.

Option 1 (preferred): Make it possible to implement a stateful transform that expand into multiple factors

My ideal API here would be to define a stateful transform that looks something like this:

model_matrix("M(mutations)", data)

where M would both do the per-site encoding as well as generate a set of individual categorical factors. I looked into the internals of how C works and it doesn't seem like stateful transforms can operate in this way? Or more generally is it not possible for an individual formula factor to expand into multiple factors?

The reason I'd prefer this option as it would simplify usage in more complicated analysis where other terms might be included in the formula. It also could potentially make re-use of model_spec on a new dataset more convenient (for example when a new dataset is missing mutations at a particular site).

Option 2: Build a formula programmatically for a transformed version of the dataset

I'm still digging into how you can accomplish this, but if I understand correctly formulaic already supports programmatic generation of formulas, so I could do the site expansion upstream of processing with formulaic and then generate the categorical terms for each individual site. This would be a reasonable solution, but I just wanted to double check something like option 1 couldn't work as I think it would simplify use for end-users.

Hi @ptonner ! Thanks for reaching out.

This is an interesting use-case! Formulaic does support nested "explosions", but doesn't implement categorical encoding for them. For example:

>>> def d():
>>>     return {"a": {"b": [1,2,3], "c": [1,2,3]}, "b": {"b": [1,2,3], "c": [1,2,3]}}

>>> formulaic.model_matrix("d()", pandas.DataFrame({"x": [1,2,3]}))
   Intercept  d()[a][b]  d()[a][c]  d()[b][b]  d()[b][c]
0        1.0          1          1          1          1
1        1.0          2          2          2          2
2        1.0          3          3          3          3

You could manually encoded the nested values in M() stateful transform if you were so inclined, but it is a bit of work.

It would be relatively straightforward, I think, to add support for automatic categorical codings on on top of this as well. If I did this would it meet your use-case? It would look something like:

>>> formulaic.model_matrix('M(mutations)', data)
   Intercept  M()[site102][AT]  M()[site102][AK] ...

where M() is returning a dictionary of site features to be categorically encoded.

interesting, I didn't know that was possible! I this this could work, but there's a couple details I'm not clear on.

For one thing, it looks like d() is treated as a single term? One reason this might be an issue is I was hoping to rely on formulaic's capability to easily expand interaction terms (e.g. something like (site1 + site2 + ...)^2), since looking into higher-order mutation interactions is a common need. If d ends up being a single term, then I think things like d()^2 won't work in a similar way? Maybe if there was support for categorical factors then this would work? As a fall back I would probably be fine writing and interface like M(..., order=2) to make these interactions explicit

Also just want to be sure - the transform M supported with the change would be extracting mutations from the dataset provided to model_matrix? E.g. model_matrix('M(mutations)', data) means there's a column mutations in data that is used to parse the mutation strings (e.g. of the form 'A108T:I150V') and build the nested dictionary?

Also, how would this work with re-using the generated ModelSpec for new datasets? So for example a test dataset might have only a subset of the mutations observed in training, and a generation of the new design matrix should follow the same encoding used in building the training data. I think this would be handled by support for categorical coding, b/c they have state memory. Just want to be sure

Would there be an easy way to test out a proof of concept implementation of categorical factors for these expanding terms? if I could play around with that aspect it would probably be easier for me to know if this would work for me

d() is indeed treated as a single term, but products of d() with other features including itself will have almost the same Cartesian product representation that you would expect... if you can get it to look "different" enough. For example:

In [8]: formulaic.model_matrix("d()*d( )", pandas.DataFrame({"x": [1,2,3]}))  # note the space
Out[8]:
   Intercept  d()[a][b]  d()[a][c]  d()[b][b]  ...  d()[a][b]:d( )[b][c]  d()[a][c]:d( )[b][c]  d()[b][b]:d( )[b][c]  d()[b][c]:d( )[b][c]
0        1.0          1          1          1  ...                     1                     1                     1                     1
1        1.0          2          2          2  ...                     4                     4                     4                     4
2        1.0          3          3          3  ...                     9                     9                     9                     9

[3 rows x 25 columns]

But it's not exactly the same (there are squares in there), and it's definitely not very satisfying... I'll see if I can think of alternative ways to get this kind of unnesting to work more nicely; but it might end up requiring the parameter driven approach you suggested above.

Also just want to be sure - the transform M supported with the change would be extracting mutations from the dataset provided to model_matrix? E.g. model_matrix('M(mutations)', data) means there's a column mutations in data that is used to parse the mutation strings (e.g. of the form 'A108T:I150V') and build the nested dictionary?

That's correct.

Also, how would this work with re-using the generated ModelSpec for new datasets? So for example a test dataset might have only a subset of the mutations observed in training, and a generation of the new design matrix should follow the same encoding used in building the training data. I think this would be handled by support for categorical coding, b/c they have state memory. Just want to be sure

That's correct also.

Would there be an easy way to test out a proof of concept implementation of categorical factors for these expanding terms? if I could play around with that aspect it would probably be easier for me to know if this would work for me

You could get started today by just manually generating the nested dictionary structure, and calling pandas.get_dummies yourself on these. You would have to keep track of the the stateful transform state doing this, but that should give you a decent sense of how things will work. I'll see if I can implement the auto categorical coding for you over the weekend (no promises... my life is pretty full these days).

Okay thanks for taking the time on this already! I appreciate all the feedback already

I played around a bit to see about a nested dictionary mapping onto mutations

def M(_):
     intermediate = pd.DataFrame(                                                                                                                  
          {"site1": ["", "a", "", "a"], "site2": ["", "", "b", "b"]}                                                                                
     )                                                                                                           
     return {                                                                                                    
         s: pd.get_dummies(c).drop(columns="").to_dict(orient="list")                                                                              
         for s, c in intermediate.items()
     }      

The intermediate representation would come from a separate utility function parsing out the individual mutations into separate site columns, but I've hard coded it here for brevity. I also drop the "" values to emulate the effect of dropping a single factor level for categorical data.

This gives:

mut = pd.Series(["", "a1b", "b2c", "a1b;b2c"])
mm = formulaic.model_matrix("M(mut)", mut.to_frame("mut"))
Intercept M(mut)[site1][a] M(mut)[site2][b]
0 1 0 0
1 1 1 0
2 1 0 1
3 1 1 1

which looks good!

Expanding this to interactions is correct, although the notation isn't ideal (sorry for different formatting)

formulaic.model_matrix("M(mut):M( mut)", mut.to_frame("mut"))
   Intercept M(mut)[site1][a]:M( mut)[site1][a] M(mut)[site2][b]:M( mut)[site1][a] M(mut)[site1][a]:M( mut)[site2][b] M(mut)[site2][b]:M( mut)[site2][b]  
0  1.0       0                                  0                                  0                                  0 
1  1.0       1                                  0                                  0                                  0                                   
2  1.0       0                                  0                                  0                                  1 
3  1.0       1                                  1                                  1                                  1                                   

It would be interesting to try this with support for categorical factors.

But at the moment, I'm leaning towards pushing the logic of generating a representation like intermediate upstream of processing by formulaic. That way I can still rely on the built-in support for categorical factors more directly and only have to:

  • added site-like columns to look like intermediate, using a function I'll define
  • build a formula based on these extracted sites (e.g. site1 + site2 + ...)

For building the formula, I was thinking some form of aliasing could be used like described here. But I'm not sure this can work for something that essentially represents multiple individual terms? E.g. something like this

mutations = C(intermediate["site1"]) + C(intermediate["site2"])                                          
formulaic.model_matrix("mutations", intermediate)

doesn't work b/c C is just wrapping the series and the resulting mutations variable is a series representing the string adding of the values. Maybe I could go about implementing something to how C works internally but instead of returning a FactorValues object with an encoder dealing with the multiple site columns?

Actually playing around with FactorValues maybe yields something useful?

Doing this

def M(data: pandas.Series):                                        
    def encoder(*args, **kwargs):                                  
        return pandas.get_dummies(intermediate, drop_first=True) 
                                                                   
    return FactorValues(data, kind="categorical", encoder=encoder) 

means I can run

formulaic.model_matrix("m.M(mut)", mut.to_frame("mut"))
Intercept m.M(mut)[site1_b] m.M(mut)[site2_c]
1 0 0
1 1 0
1 0 1
1 1 1

which looks promising

Although interactions doesn't quite work as intended (I had to use aliasing for this, b/c I was getting an error Unable to evaluate factor m.Mut(mut ) when trying to use a " " to create two different terms)

m1 = M(mut)
m2 = M(mut)

formulaic.model_matrix("m1:m2", mut.to_frame("mut"))
    Intercept m1[site1_b]:m2 m1[site2_c]:m2[ m1[site1_b]:m2[ m1[site2_c]:m2 
    1.0       0              0               0               0 
    1.0       1              0               0               0              
    1.0       0              0               0               1 
    1.0       1              1               1               1              

Do you think this is something that could addressed by doing some more digging in how contrasts are currently built for single categorical factors? I can look around and try some more things if this seems like it would work

@ptonner Apologies for the delay. Life has been busy.

I was getting an error Unable to evaluate factor m.Mut(mut )

Huh... interesting. What was the actual error message?

Do you think this is something that could addressed by doing some more digging in how contrasts are currently built for single categorical factors? I can look around and try some more things if this seems like it would work

Hmm... I'm not sure I understand this. What is the error you are talking about? The above one? Or is the output matrix not what you expected? The biggest problem I see is that you get squared terms like m1[x]:m2[x].

I've thought about this a bit more, and am thinking about introducing an "unnesting" syntax like:
y ~ &D:&D

The & would expand the first level of a dictionary in an evaluated factor into a set of top-level terms during term scoping (where we decide whether to include the reduced form of a term). This would be able to be deepened using &&D, for example, for unnesting dictionaries of dictionaries. Using & would also set a bit on token at the parser-level that indicates that this should have its own unique hash, and so self-products would not collapse. Somewhere we would need to deal with the fact that self-products of the unnested terms may now exist... but I don't think that will be too hard, and can also be dealt with during scoping.

I'm a little torn, because this does add complexity, but I'm curious what you think about this. I'm pretty sure it would solve your use-case... do you think it worthwhile?

No worries on the delay!

I'll have to get back to you about the errors, I need to recreate the environment I was testing in.

But for your suggestion, it's a bit hard for me to know if this would address my issue. It seems like it would? But I also agree that it might be a bit complex if this is the only use case.

As another alternative, would it be possible to support the idea of expanding a regular expression matching multiple columns into separate terms? E.g. something like #site\d+ matches all columns like site001, site002, etc. Then, if I've pre-expanded into the separate siteXXX columns and marked them as categorical then I could likely get the behavior I'd like.

Also, just FYI I have got a workable pipeline for my purposes similar to this second idea. Essentially:

  1. Expand a mutations column into individual siteXXX columns, marking as categorical. This is done completely outside of formulaic. It would be nice if somehow could be incorporated into the formula expansion and reification. But since this seems like it will be difficult to incorporate into formulaic, I'm starting to think this might make more sense as I have it now as an upstream data preprocessing step.
  2. Expanded a formula in a similar fashion to the regular expression idea outlined above, but instead of doing inside formulaic I just do this as a string substitution when building the formula string.

So all this to say: if this feature would be pretty complex (seems like it would?) then I'm fine closing this for now rather than adding the complexity. I have a workable solution for my needs.