ENH: Use pointers to make target columns available at source level
Closed this issue · 10 comments
Is your feature request related to a problem?
Currently, pointers allow us to aggregate variables from source individuals (e.g. children) to target individuals (e.g. parents). Sometimes it is necessary to make target columns available on the source level (e.g. in the case of Unterhaltsvorschuss).
For more details, see the discussion we had here.
Once this is implemented, we might be able to get rid of the conditions in kinderzuschl_bruttoeink_eltern_m
and similar functions (see this comment).
Should this be yet another addition to the complex name convention for parameters, like <foreign_key_column>_<target_column> or how do you picture a general solution for this? Or should this be solved on a case-by-case basis?
Also, can't this be solved by a simple DataFrame.join
where this connection is needed?
In principle this is a simple join, pinky that we don't have dataframes but numpy/Jax arrays.
The naming convention you propose sounds great!
The naming convention you propose sounds great!
Ok, this would also need
- A distinct separator between
foreign_key_column
andtarget_column
(not an underscore), - A distinct keyword to indicate the desired interpretation of this parameter.
Both requirements could be fulfilled at once by having a unique infix, like <foreign_key_column>_to_<target_column>
. Though we should probably pick something less likely to occur in other parameter names than _to_
.
That said, I find the current name convention already quite hard to work with due to all the "magic" it encodes (time conversion, aggregation, renaming via dates_active
, ...). Because of this, I'd rather suggest adding a utility function that takes three inputs:
- Array for the foreign keys.
- Array for the primary keys.
- Array for the targets.
This function can then return another array with the targets that belong to the respective foreign keys. Where needed, this function can be called explicitly. This is essentially the same as specifying a parameter using some name convention to implicitly invoke this function. If caching of the result is desired, the call can be wrapped into another function like _unterhaltsvors_empf_einkommen_m
.
I don't follow all the way yet, I am afraid. How would that work with the DAG? How would somebody who wanted to extend GETTSIM call that function?
I'd also be fine with doing something provisional right now and then re-consider when re-designing the interface after the move to a hierarchical scheme.
Let's say we go with some name convention and add a parameter p_id_unterhaltsvors_empf_to_einkommen_m
to some policy function:
- This would implicitly add dependencies to the columns
p_id_unterhaltsvors
(foreign key),p_id
(primary key), andeinkommen_m
(target). This does not seem obvious to users and needs special treatment on the implementation side. - No such function exists anywhere, so it needs to be created on-the-fly. The created function will internally call something like
join(p_id_unterhaltsvors, p_id, einkommen_m)
.
So what I'm asking is why we don't make the dependency to the three columns and the join call explicit:
def p_id_unterhaltsvors_empf_to_einkommen_m(
p_id_unterhaltsvors,
p_id,
einkommen_m,
):
return join(p_id_unterhaltsvors, p_id, einkommen_m)
The function p_id_unterhaltsvors_empf_to_einkommen_m
becomes part of the DAG as usual and can be "called" by adding a parameter with the same name to another function. Likewise, it can be overwritten by data columns with the same name. Now, this function can also have any name, so we could also call it, say, _unterhaltsvors_empf_einkommen_m
.
In this case, we would only provide the join
function but not augment the name convention.
That is fine, but we would need a mechanism of not calling jnp.vectorize
(or whatever it is called) on these functions, right? So far, all functions in GETTSIM (except for those special ones like aggregation etc.) are built on the premise that they are written for individual rows of the data. Hence my confusion.
But again,
I'd also be fine with doing something provisional right now and then re-consider when re-designing the interface after the move to a hierarchical scheme.
We could introduce another decorator to skip vectorization.
Fine, if you prefer that route, let's take it. When redesigning the interface, we should re-visit the whole structure and see what makes most sense across all these different functions.