Allow specification of non-boolean expressions
bettinardi opened this issue · 1 comments
A requested enhancement to PopulationSim has been provided to allow for non-boolean controls to be specified. An example of what this would look like in practice is to set the total income for a zone as opposed to the number of households in each income category (a distribution). In this example, a user might have a better sense of what the average income might be, as opposed to the distribution. The user could multiply the average income by the number of households in the zone and set a control that looks like -
controls.csv
target,geography,seed_table,importance,control_field,expression
SumIncome,MAZ,households,1000,SumIncome,households.income
control_totals_MAZ.csv
MAZ,SumIncome
1,5000000
The user that is requesting this change identified that commenting out line 140 of populationsim/populationsim/integerizer.py allows for the desired type of control (here's the removed line):
assert (max_incidence_value[control_is_hh_based] <= 1).all()
However, after further discussing this, there's concern with PopulationSim developers that removing line 140 might introduce issues in the priority of how controls are met. Some boolean controls might be greatly under weighted compared to controls like the one specified in this example. Further, if several varying styles of controls are introduced, their numeric size might have more or less influence in the outcome than the user is expecting, and in general this could make setting and specifying the value of "importance" of each control very difficult and hard to understand.
Therefore, this enhancement is request is to allow the flexibility in the controls such that the specified average income example could be specified by the user, while internally normalizing the importance of each control such that the importance factors set by the user still make some intuitive sense.
Lastly, I would just note (at the risk of making the scope of this issue too large) - the importance settings on the controls have never been very intuitive. It would be great to revisit how a user can set the importance of a variable and to create a coded operation that results in an outcome that is easier to anticipate (increasing the importance by X level, results in a reasonable response in the result).
An improved capture of the issue from Binny Paul:
PopulationSim is designed to work with Logical or Boolean expressions (equality, inequality, and range) to specify marginal controls. Some examples of Boolean expressions are:
- household.income <= 25000
- person.age < 18
The user requesting the enhancement would like to specify a non-Boolean expression for a use-case described below. Currently, specification of a non-Boolean expression triggers the following assertion error:
assert (max_incidence_value[control_is_hh_based] <= 1).all()
The assertion check was placed by the developers as a sanity check on user’s expressions. The user is requesting to remove this check to allow specification of non-Boolean expressions.
Use-Case:
To control for average household income in the absence of a marginal income distribution.
If only target average income is available, the marginal control can be specified as total income for the zone computed as:
SumIncome = Total_households * target_average_income
The controls specification will look as follows:
controls.csv
target,geography,seed_table,importance,control_field,expression
SumIncome,MAZ,households,1000,SumIncome,households.income
control_totals_MAZ.csv
MAZ,SumIncome
1,5000000
The expression for this use case is – “household.income”, which is not a Boolean expression.
Potential implementation issue:
This change can possibly introduce large numbers in the incidence table. Currently, with Boolean expressions, the incidence table consists of only small integers. Mixing of large and small integers can throw off the list balancing and integerization (linaer programming) optimization. Further testing will be needed to evaluate the impact of this change.
Work-around with existing features:
In the absence of this feature, the users can test this workaround. The steps are as follows:
- Use the target average income to scale the household income in the seed population and save in temporary fields.
Scaling factor = (target_average_income * total_households)/sum(hh_incomes)
-
Use the temporary scaled income to generate income distribution using some thresholds (e.g., $25K, $50K, $100K, $150K+). For a closer match with the target average, use more thresholds.
-
Use the distribution generated from the scaled income fields as a regular household income control on the original income field.