iza-institute-of-labor-economics/gettsim

BUG: `join_numpy` return if foreign key is < 0

Closed this issue · 5 comments

In the Unterhaltsvorschuss module, we use the join_numpy function to determine whether the parent that receives Kindergeld for a specific child is a single parent:

def parent_alleinerz(
    p_id_kindergeld_empf: np.ndarray[int],
    p_id: np.ndarray[int],
    alleinerz: np.ndarray[bool],
):
    return join_numpy(p_id_kindergeld_empf, p_id, alleinerz)

This returns True if p_id_kindergeld_empf is not determined, i.e. set to -1. True might be a bad default in this case.

Proposed solution

Allow for a default in the join_numpy function that is returned if the foreign key is not determined.

IIUC, it does not return True in all cases, but the last value of the array ([-1]).

I think the behaviour should special-case foreign keys below zero and return our default for missing values (-1 in case of ints, np.nan in case of floats, error (?) for bool).

We could also achieve that via an extra argument as @MImmesberger suggested, I'd be fine with both.

Inferring the default value from the target column data type sounds good to me. However, we don't want errors for bools (in my case, I use the parent ID as a foreign key, so there will always be missings). Maybe we can set the default explicitly for bools only? The correct default probably depends on the context.

If we need missings, we'll need to convert bools to int at the column/function level so long as there is no Jax support for missings.

Just in case there is a misunderstanding: The missings that I was referring to are the -1s of p_id_elternteil_x. In my mind, the new column won't have missings because we set defaults if the foreign key is -1.

Just in case there is a misunderstanding: The missings that I was referring to are the -1s of p_id_elternteil_x. In my mind, the new column won't have missings because we set defaults if the foreign key is -1.

Fair enough, the best thing is to be explicit indeed.