Humans make tens of thousands of decisions every day. Project Causica aims to develop machine learning solutions for efficient decision making that demonstrate human expert-level performance across all domains. We develop advanced machine learning solutions in causal discovery, causal inference, and Bayesian experimental design using modern (probabilistic) deep learning methods. Our conceptual framework is to divide decisions into two types: "best next question" and "best next action".
In daily life, one type of decision we make relates to information gathering for "get to know" decisions; for example, a medical doctor takes a medical test to decide the correct diagnosis for a patient. Humans are very efficient at gathering information and drawing the correct conclusion, while most deep learning methods require significant amounts of training data. Thus, the first part of project Causica focuses on enabling machine learning solutions to gather personalized information, allowing the machine to know the "best next question" and make a final judgment efficiently [1,2,6]. Our technology for "best next question" decisions is driven by state-of-the-art algorithms for Bayesian experimental design and active learning.
The second type of decision made in many domains is intervention, which can be summarised as selecting the "best next action ". For example, a business owner needs to decide which action will lead to greater customer satisfaction; a teacher needs to decide which exercise to give students to help them learn most effectively. Humans are very efficient at making such decisions, primarily due to our causal reasoning abilities. Thus, in project Causica, we develop end-to-end causal inference methods which use existing data to perform causal discovery and compute causal inference quantities such as (conditional) average treatment effect [8,3]. We not only provide our deep learning-based framework, but we also connect other existing methods such as DoWhy and EconML as alternatives, allowing users to choose the most suitable methods for their applications.
With these decision-making goals, one can use our codebase in an end-to-end way for decision-making. We also provide the flexibility to use any core functionalities such as missing value prediction, best next question, causal discovery, causal inference, etc, separately depending on the users' needs.
Our technology has enabled personalized decision-making in real-world systems, combining multiple advanced research methodologies in simple APIs suitable for research development in the research community, and commercial use by data scientists and developers. For commercial applications, please reach out to us at causica-request@microsoft.com if you are interested in using our technology as a service.
This project splits causal end to end code from the Azua repo found here Azua.
Real-world data-driven decision making requires causal inference to ensure the validity of drawn conclusions. However, it is very uncommon to have a-priori perfect knowledge of the causal relationships underlying relevant variables. DECI allows the end user to perform causal inference without having complete knowledge of the causal graph. This is done by combining the causal discovery and causal inference steps in a single model. DECI takes in observational data and outputs ATE and CATE estimates.
For more information, please refer to the paper.
Model Description
DECI is a generative model that employs an additive noise structural equation model (ANM-SEM) to capture the functional relationships among variables and exogenous noise, while simultaneously learning a variational distribution over causal graphs. Specifically, the relationships among variables are captured with flexible neural networks while the exogenous noise is modelled as either a Gaussian or spline-flow noise model. The SEM is reversible, meaning that we can generate an observation vector from an exogenous noise vector through forward simulation and given a observation vector we can recover a unique corresponding exogenous noise vector. In this sense, the DECI SEM can be seen as a flow from exogenous noise to observations. We employ a mean-field approximate posterior distribution over graphs, which is learnt together with the functional relationships among variables by optimising an evidence lower bound (ELBO). Additionally, DECI supports learning under partially observed data.
Simulation-based Causal Inference
DECI estimates causal quantities (ATE / CATE) by applying the relevant interventions to its learnt causal graph (i.e. mutilating incoming edges to intervened variables) and then sampling from the generative model. This process involves first sampling a vector of exogenous noise from the learnt noise distribution and then forward simulating the SEM until an observation vector is obtained. ATE can be computed via estimating an expectation over the effect variable of interest using MonteCarlo samples of the intervened distribution of observations. DECI can not simulate samples from conditional intervened distributions directly. However, we can learn an estimator of the conditional intervened distribution from samples of the joint distribution of conditioning variables and effect variables. Specifically, we use radial-basis-function kernel regression to evaluate expectations in the CATE formula.
Specifying a prior over graphs
If some a-prior knowledge of the causal graph is available, it can be incorporated as a prior for DECI by using the informed_deci
model. The a-prior graph should is passed to DECI as a part of the CausalDataset
. The hyperparameters used to specify strength of belief in the prior graph are discussed below.
Specifying a noise model
If the exogenous noise is known to be Gaussian, we recommend employing the deci_gaussian
model. This model has its Gaussian exogenous noise distribution mean set to 0 while its variance is learnt.
In the more general setting, we recommend using deci_spline
. Here the exogenous noise distribution is a flexible spline flow that is learnt from the data. This model provides most gains in heavy-tailed noise settings, where the Gaussian model is at risk of overfitting to outliers.
The command to train this model with the csuite_symprod_simpson dataset and evaluate causal discovery and (C)ATE estimation performance is:
python run_experiment.py csuite_symprod_simpson --model_type deci_spline --model_config configs/deci/true_graph_deci_spline.json -dc configs/dataset_config_causal_dataset.json -c -te
We use different data for each of our 3 experiments: large scale evaluation data, csuite synthetic data and Twins/IHDP
Large scale evaluation
There's three different types of data we use:
- Synthetic. Samples a random graph with N nodes and E edges and generates the data using gaussian processes for the functions in the SEM. Go to
causica/data_generation/large_synthetic
and runpython data_generation.py
. This will store the datasets generated in thecausica/data_generation/large_synthetic/data/
directory. - Pseudo-real. Syntren, a simulator that creates synthetic transcriptional regulatory networks and produces simulated gene expression data that approximates experimental data. To generate the data open the file
causica/data_generation/syntren/generate.py
and follow the instructions within. - Real. Protein cells datasets, from Sachs et. al. 2005. The data measures the expression level of different proteins and phospholipids in human cells. To generate the data open the file
causica/data_generation/protein_cells/generate.py
and follow the instructions within.
CSuite
The CSuite datasets can generated by running causica/data_generation/csuite/simulate.py
from the top-level directory of the package.
The datasets are created inside data/
.
Twins & IHDP
We also include two causal inference benchmark datasets for ATE evaluation: the twins
(twin birth datasets in the US)
dataset and IHDP
dataset (Health and Development Program data). Both are semi-synthetic datasets that are processed based
on similar procedures as described in (Louizos et al., 2017), as detailed below.
-
IHDP
dataset. This can be generated by simply callingresearch_experiments/load_ihdp_to_generate_CATE_dataset.py
. This dataset contains measurements of both infants and their mother during real-life data collected in a randomized experiment. The main task is to estimate the effect of home visits by specialists on future cognitive test scores of infants. The outcomes of treatments are simulated artificially as in (Hill et al., 2011) hence both the factual and counterfactual outcomes of both treatments (home visits or not) on each subject are known (counterfactuals for evaluation only). To make the task more challenging, additional confoundings are manually introduced by removing a subset (non-white mothers) of the treated children population. In this way we can construct theIHDP
dataset of 747 individuals with 6 continuous covariates and 19 binary covariates. All continuous covariates are normalized. -
Twins
dataset. This can be generated by callingresearch_experiments/load_twins_to_generate_CATE_dataset.py
. This dataset consists of twin births in the US between 1989-1991. Only twins which with the same sex born weighing less than 2kg are considered. The treatment is defined as being born as the heavier one in each twins pair, and the outcome is defined as the mortality of each twins in their first year of life. Therefore, by definition for each pair of twins, we both factual and counterfactual outcomes of the treatments (counterfactuals for evaluation only). Following (Louizos et al, 2017), we also introduce artificial confounding using the categoricalGESTAT10
variable. All continuous covariates are normalized.
To reproduce all results (with the pipeline) you will need to generate all the data, as described above, and run all simulations as described here.
Methods are specified using a combination of the --model_type
and -model_config
arguments to the run_experiment.py
script.
See the section for IHDP and Twins Benchmarks for illustrative examples of the commands that are used.
Once runs are finished, all experimental results can be downloaded by pointing the script research_experiments/e2e_causal_inference/notebooks/download_results.py
at the experiment name of interest on Azure ML. This will download the results to a local folder of choice. These results can then be loaded using functionality in research_experiments/e2e_causal_inference/notebooks/utils.py
. We provide a number of notebooks which load this data and generate the plots in the paper "Deep End to End Causal Inference" as described below.
To run DECI on IHDP
and twins
datasets, simply run one of the following, using IHDP ATE as an example (for CATE estimation,
or twins dataset, simply change ihdp_norm
to ihdp_norm_cate
, twins
, or twins_cate
):
- DECI Gaussian on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_gaussian --model_config configs/deci/deci_gaussian.json -te -dc configs/dataset_config_causal_dataset.json
- DECI Spline on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_spline --model_config configs/deci/deci_spline.json -te -dc configs/dataset_config_causal_dataset.json
- True graph DECI Gassian on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_gaussian --model_config configs/deci/true_graph_deci_gaussian.json -te -dc configs/dataset_config_causal_dataset.json
- True graph DECI Spline on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_spline --model_config configs/deci/true_graph_deci_spline.json -te -dc configs/dataset_config_causal_dataset.json
- DECI Gaussian DoWhy linear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_dowhy --model_config configs/deci/deci_gaussian_dowhy_linear.json -te -dc configs/dataset_config_causal_dataset.json
- DECI Spline DoWhy linear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_dowhy --model_config configs/deci/deci_spline_dowhy_linear.json -te -dc configs/dataset_config_causal_dataset.json
- True graph DoWhy linear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type true_graph_dowhy --model_config configs/deci/dowhy_linear.json -te -dc configs/dataset_config_causal_dataset.json
- DECI Gaussian DoWhy nonlinear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_dowhy --model_config configs/deci/deci_gaussian_dowhy_nonlinear.json -te -dc configs/dataset_config_causal_dataset.json
- DECI Spline DoWhy nonlinear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type deci_dowhy --model_config configs/deci/deci_spline_dowhy_nonlinear.json -te -dc configs/dataset_config_causal_dataset.json
- True graph DoWhy nonlinear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type true_graph_dowhy --model_config configs/deci/dowhy_nonlinear.json -te -dc configs/dataset_config_causal_dataset.json
- PC DoWhy linear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type pc_dowhy --model_config configs/deci/pc_dowhy_linear.json -te -dc configs/dataset_config_causal_dataset.json
- PC DoWhy nonlinear on IHDP for ATE estimation:
python run_experiment.py ihdp_norm --model_type pc_dowhy --model_config configs/deci/pc_dowhy_nonlinear.json -te -dc configs/dataset_config_causal_dataset.json
To launch continuous CSuite experiments, use a command of the form
python run_experiment.py <dataset> --model_type <model_type> --model_config <model_config> -c -te -dc configs/dataset_config_causal_dataset.json
where <dataset>
should be one of
csuite_lingauss
,csuite_linexp
,csuite_nonlingauss
,csuite_nonlin_simpson
,csuite_symprod_simpson
,csuite_large_backdoor
,csuite_weak_arrows
. The--model_type
and--model_config
arguments can be specified as in the section for IHDP and Twins Benchmarks, but note the use of the-c
flag for CSuite datasets to evaluate causal discovery.
To launch discrete CSuite experiments, the -el
argument should also be passed, giving
python run_experiment.py <dataset> --model_type <model_type> --model_config <model_config> -c -te -el -dc configs/dataset_config_causal_dataset.json
where <dataset>
should be one of
csuite_cat_to_cts
,csuite_cts_to_cat
,csuite_cat_collider
,csuite_cat_chain
,csuite_mixed_simpson
,csuite_large_backdoor_binary_t
,csuite_weak_arrows_binary_t
,csuite_mixed_confounding
. The--model_type
and--model_config
arguments can again be specified as in the section for IHDP and Twins Benchmarks.
The illustrative plots from the paper can be generated using research_experiments/e2e_causal_inference/notebooks/CATE_plots.ipynb
. However, this requires DECI and true_graph_DECI to have been run locally or the complete run output folders to have been downloaded from Azure ML.
To launch large-scale graph experiments, use a command of the form
python run_experiment.py <dataset> --model_type <model_type> --model_config <model_config> -c <-te> -dc configs/dataset_config_causal_dataset.json
where <dataset>
should either be {dataset}_seed_{seed}_std
, with
{dataset}
: eitherprotein_cells
orsyntren
{seed}
: the seed used for dataset splitting, in[1, 5]
- A
miss_
prefix can be appended for experiments with partial observability
or follow this convention for ER and SF graphs {graph_type}_{N}_{E}_spline_sem_{noise_type}_noise_{seed}_seed
, with:
{graph_type}
: eitherER
orSF
{N}
: the number of nodes; and{E}
: the number of edges - we considered{N}
in[16, 64, 128]
and{E}
in[N, 4N]
{noise_type}
: eitherfixed
(for Gaussian noise) ormlp
(for non-Gaussian noise){seed}
: the seed used for the dataset generation - we used the range[0, 4]
- An additional
_partial
postfix for experiments with partial observability
The --model_type
and --model_config
arguments can be specified as in the section for IHDP and Twins Benchmarks, but
note the use of the -c
and -te
flags for causal discovery evaluation and treatment effect evaluation, respectively. Note, the -te
flat is not used for the protein
or syntron
datasets.
A full example would be:
python run_experiment.py SF_64_64_spline_sem_mlp_noise_4_seed --model_type deci_dowhy --model_config configs/deci/deci_spline_dowhy_nonlinear.json -c -te -dc configs/dataset_config_causal_dataset.json
Experiments are launched by calling python evaluation_pipeline/runs/run_icml_inform.py
Once results are downloaded locally, they can be viewed with research_experiments/e2e_causal_inference/notebooks/prior_plots.ipynb
. Due to a slightly different experiment structure, this notebook uses helper functions stored in research_experiments/e2e_causal_inference/notebooks/prior_notebook_utils.py
.
Model configs:
lambda_dag
: Coefficient for the prior term that enforces the learnt graph to be a DAG.lambda_sparse
: Coefficient for the prior term that enforces similarity to the prior matrix W_0 or sparsity in the case of an empty prior matrix.tau_gumbel
: Temperature for the gumbel softmax trick.spline_bins
: How many bins to use for spline flow base distribution if the 'spline' choice is made.layers_imputer
: Number and size of hidden layers for imputer NN (e.g. [20,10,20] for a 3 hidden layer MLP with 20,10 and 20 neurons per hidden layer).var_dist_A_mode
: Variational distribution for adjacency matrix. Admits {"simple", "enco", "true"}. "simple" parameterizes each edge (including orientation) separately. "enco" parameterizes existence of an edge and orientation separately. "true" uses the true graph.norm_layers
: bool indicating whether all MLPs should use layer normres_connection
: bool indicating whether all MLPs should use layer normlayers_encoder
: Optional list indicating width of layers in GNN encoder MLP. If unspecified, defaults to a size dependent on size of observation vector.layers_decoder
: Optional list indicating width of layers in GNN decoder MLP. If unspecified, defaults to a size dependent on size of observation vector.cate_rff_n_features
: number of random features to use in function approximation when estimating CATE.cate_rff_lengthscale
: lengthscale of RBF kernel used when estimating CATE.prior_A
: prior adjacency matrix. Does not need to be a DAG.prior_A_confidence
: degree of confidence in prior adjacency matrix enabled edges between 0 and 1. e.g. a value of 0.5 enforces a belief that the entries in prior_A are correct 50% of the time.
Training hyperparameters:
learning_rate
: Learning rate used at the beginning of each augmented Lag steplikelihoods_learning_rate
: Learning rate used for noise modelsbatch_size
: Batch size usedstardardize_data_mean
: Whether to center data or notstardardize_data_std
: Whether to standardize data to have unit variance or notrho
: Initial rho for the auglag procedure. This is the value multiplying the square of the dag penaltyalpha
: Initial alpha for the auglag procedure. This is the term multiplying the linear dag penaltysafety_rho
: Max value of rho allowedsafety_alpha
: Max value of alpha allowedmax_steps_auglag
: Max number of auglag steps allowedmax_p_train_dropout
: Max probability for artificially dropping data during training to regularize imputation networkanneal_entropy
: Admits "linear" and "noanneal". With the latter the entropy term in the ELBO (of q(A)) is left without any changes, with the former it is annealed during training. More specifically, for the "auglag" optimization mode, it annealed with a factor 1/step after the 5th auglag step.progress_rate
: minimum ratio of relative decrease in DAG penalty from one outer loop step to the next of augmented Lag optimisation. If the progress_rate is not met, then the strength of the quadratic DAG penalty is increased.max_auglag_inner_epochs
: Number of optimisation steps to take in inner loop of augmented Lag optimisationmax_epochs_per_step
: Max number of epochs allowed per auglag stepreconstruction_loss_factor
: Multiplier for reconstruction loss in partially observed data setting
If you have used the models in our code base, please consider to cite the corresponding papers:
[1], (VISL) Pablo Morales-Alvarez, Wenbo Gong, Angus Lamb, Simon Woodhead, Simon Peyton Jones, Nick Pawlowski, Miltiadis Allamanis, Cheng Zhang, "Simultaneous Missing Value Imputation and Structure Learning with Groups", ArXiv preprint
[2], (DECI) Tomas Geffner, Javier Antoran, Adam Foster, Wenbo Gong, Chao Ma, Emre Kiciman, Amit Sharma, Angus Lamb, Martin Kukla, Nick Pawlowski, Miltiadis Allamanis, Cheng Zhang. Deep End-to-end Causal Inference. Arxiv preprint (2022)
Other references:
- Louizos, Christos, et al. "Causal effect inference with deep latent-variable models." Advances in neural information processing systems 30 (2017).
- Hill, Jennifer L. "Bayesian nonparametric modeling for causal inference." Journal of Computational and Graphical Statistics 20.1 (2011): 217-240.
We use Poetry to manage the project dependencies, they're specified in the pyproject.toml. To install poetry run:
curl -sSL https://install.python-poetry.org | python3 -
To install the environment run poetry install
, this will create a virtualenv that you can use by running either poetry shell
or poetry run {command}
. It's also a virtualenv that you can interact with in the normal way too.
More information about poetry can be found here
We use mlflow for logging metrics and artifacts. By default it will run locally and store results in ./mlruns
.
It is called inside run_experiment_on_parsed_args
and the target (SQL database etc) can be changed as required.