Detecting clinician implicit biases in diagnoses using proximal causal inference [paper without appendix][full paper]
We provide a user-friendly tool to detect implicit biases in observational datasets. This database provides example application to synthetic data as Jupyter notebooks under notebooks
. Application to real data should follow similarly, although feature curation and analysis is on a case-by-case basis.
The main class of our method is ProximalDE
(found in proximalde.proximal.py
), which calculates the implicit bias direct effect and provides access to all the auxiliary tests used to validate the result.
The repository can be installed by running the following commands:
$ git clone https://github.com/syrgkanislab/hidden_mediators
$ cd hidden_mediators
$ pip install -r requirements.txt
$ python setup.py install
Our Jupyter notebooks also include the required script for running on Google Colab. First time setup (if hidden_mediators
isn't installed manually or using code in your Colab space) requires running, from the Colab environment,
from google.colab import drive
drive.mount('/content/drive') # or wherever your drive is mounted
%cd drive/MyDrive/Colab\ Notebooks # or whatever path where notebooks are stored
! git clone https://github.com/syrgkanislab/hidden_mediators
%cd hidden_mediators
! pip install -r requirements.txt
! python setup.py install
Again, you could also manually upload the hidden_mediators
to the appropriate Colab directory directly.
After this initial installation, you should only run
from google.colab import drive
drive.mount('/content/drive') # or wherever your drive is mounted
%cd drive/MyDrive/Colab\ Notebooks # or whatever path where notebooks are stored
%cd drive/MyDrive/Colab\ Notebooks/hidden_mediators
The bulk of our method for estimating and evaluating implicit bias effects is in the ProximalDE
class, which can be called as such:
estimator = ProximalDE(model_regression, model_classification, binary_D, binary_Z, binary_X, binary_Y, ivreg_type, alpha_multipliers, alpha_exponent, cv, semi, n_jobs, verbose, random_state)
where all the arguments are optional (see proximalde.proximal.py
for further description on arguments and defaults). Briefly, ProximalDE
initiates the specifics of the models used for residualizing
The estimator is then fit over a dataset to calculate the implicit bias effect
est.fit(W, D, Z, X, Y)
where
Results can then computed and accessed via:
estimator.summary() #displays the tables
which will display three tables: (1) the table containing the point estimate, standard error, and confidence interval; (2) the table containing the R2 of each of the four models (
sm = estimator.summary()
point_table = sm.tables[0]
r2_table = sm.tables[1]
test_table = sm.tables[2]
To run the 5th test, the proxy covariance rank test, we can run
svalues, svalues_crit = estimator.covariance_rank_test(calculate_critical=True)
Additional analysis tests are detailed in the notebooks and include:
- Running the weak IV confidence interval (
estimator.robust_conf_int(alpha=0.05)
) - Analyzing influence scores. We calculate the influence score of the estimate (as described in the paper) as well as other metrics like Cook's distance (
inf_diag = est.run_diagnostics()
). After runningrun_diagnostics
, the size and effect of high-influence sets can be analyzed. - Bootstrapped estimation by resampling and re-estimating at various stages (
estimator.bootstrap_inference(stage=stage, n_subsamples=n_subsamples, fraction=0.5)
)
Finally, we provide two additional algorithms we developed to aid in implicit bias estimation: semi-synthetic data generation and the proxy selection algorithm. We describe these further below.
We provide several notebooks with example computation and analyses:
This notebooks presents the basics of how to run use the ProximalDE
class and how to run the tests we provide to validate the results. This includes a very simple influence analysis, as well as subsample bootstrap experiments.
- Running on synthetic data - Purely synthetic data can be generated for experimentation by specifying parameters into
W, X, Z, D, Y = gen_data(a, b, c, d, e, f, g, pm, pz, px, pw, n, sm=sm, seed=seed)
Knowing the implicit bias effect
-
Semi-synthetic generator - Given a real dataset, we can compute a semi-synthetic dataset with known implicit bias effect
$c = \theta$ , as detailed in our paper (although in this notebook, we naively use pure synthetic data as the input dataset). We detail this process in the notebook. -
Custom regression models - If user wants to try other models besides the available model options for residualizing
$W$ , we provide a notebook walking you through this. The main thing this model must inherit is theBaseEstimator, RegressorMixin
classes (from sklearn). We provide an exampleXGBRegressorWrapper
custom model that can then be passed into theProximalDE
class to use for residualizing$W$ (i.e.,ProximalDE(model_regression=XGBRegressionWrapper(), semi=False)
). -
Proxy selection algorithm - In our work, we found using the entire set of available proxies
$X$ and$Z$ leads to both the dual and primal violation tests failing. In this notebook, we work backwards essentially by simulating synthetic data that fails both the dual and primal tests. Importantly, this notebook also introduces how to call and use the proposed proxy selection algorithm.
This notebook walks through each of the 5 tests we propose (dual violation, primal violation, strength identification tests, covariance rank test) and generates examples in which we expect and see that each test fails when violations in synthetic data are violated. For example, if we generate
Applying this method to real data requires first and foremost careful selection into which variables you are designating as
After variables have been selected, missingness should be analyzed and handled appropriately. Finally, the data can be grouped where
- If
$W,Z,X$ are high-dimensional (i.e. >50), repeatedly residualzing$W$ might be expensive, and it might be wise to save the residuals once and load automatically for usage. - We recommend running
ProximalDE
over all the data before assuming the proxy selection algorithm should be used. If any of the tests fail, there likely needs to be a modification of variables (i.e., see the paper andTests.ipynb
for better intuition on how a test failure could inform how variables should be updated). Only if both the dual and primal fail should the proxy selection algorithm be run (and again, it should be done on a separate split of the data than the split used to evaluate the proxy subset withProximalDE
).
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or feedback, please contact:
- Name: Kara Liu
- Email: karaliu [at] stanford . edu