/intdag

Estimation and inference of a directed acyclic graph with unspecified interventions.

Primary LanguageRMIT LicenseMIT

intdag: An R Package For Interventional DAG Estimation and Inference

This repository contains an implementation of the paper

Installation

R Packages

First, install the package glmtlp which implements the DC projection algorithm in the paper.

devtools::install_github("chunlinli/glmtlp")

Then install the package intdag which offers the peeling causal discovery method and data-perturbation/asymptotic inference.

devtools::install_github("chunlinli/intdag/intdag")

R Scripts of utilities

Before proceeding, source the following R scripts for utility functions required in the following illustration.

Assume the working directory is the cloned repository https://github.com/chunlinli/intdag.

source("utility.R")
set.seed(1110)

A Quick Example

First, we generate a random graph with p=10 and q=20.

p <- 10
q <- 20
graph <- graph_generation(p = p, q = q, graph_type = "random", iv_sufficient = FALSE)

Note that we use the option iv_sufficient = FALSE, which means a crucial assumption in the paper -- Assumption 1C is violated. Let's print the $U$ matrix.

graph$u

This is a p by p adjacency matrix of the DAG that we want to recover and/or make inference. Note that $U_{jk} \neq 0$ means an edge from $Y_{j}$ to $Y_{k}$.

Then print the $W$ matrix.

graph$w

This is a q by p matrix, indicating the interventional relations of an intervention varibale $X_{l}$ and a primary variable $Y_{j}$, where $W_{lj} \neq 0$ means $X_{l}$ directly intervenes on $Y_{j}$. In above, w corresponds to the simulation Setup B in the paper.

Next, generate a random sample of size n=200 based on the graph. According to the analysis in the paper, the distribution of intervention variables $X$ does not matter too much. Here we generate $X$ so that they have an AR(1) correlation structure.

n <- 200
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
rho <- 0.5
if (rho != 0) {
    for (j in 2:q) {
        x[, j] <- sqrt(1 - rho^2) * x[, j] + rho * x[, j - 1]
    }
}

Note that x is an n by q matrix.

Then we generate the $Y$ variables, the ones of primary interest.

y <- (x %*% graph$w + rmvnorm(n, sigma = diag(seq(from = 0.5, to = 1, length.out = p), p, p))) %*% solve(diag(p) - graph$u)

Here y is an n by p matrix.

Now fit the model using intdag. Note that intdag calls glmtlp for solving multi-response regression in its first step.

v_out <- v_estimation(y = y, x = x, model_selection = "bic")

Second, use peeling algorithm to recover the topological layers.

top_out <- topological_order(v_out$v)
an_mat <- top_out$an_mat
in_mat <- top_out$in_mat

Third, refit to estimate $U$ and $W$.

discovery_out <- causal_discovery(y = y, x = x, an_mat = an_mat, in_mat = in_mat)

The final estimate for $U$ is:

discovery_out$u

We compare the final estimate with the true graph in the structural Hamming distance.

sum(abs((abs(discovery_out$u) > 0.05) - (graph$u != 0)))

Here we use a truncation threshold 0.05 to screen the small (noisy) coefficients.

Contents

The R package is placed in directory ./intdag/.

The extensive simulation code is placed in directory ./simulation/.

Citing information

If you find the code useful, please consider citing

@article{li2023inference,
    author = {Chunlin Li, Xiaotong Shen, Wei Pan},
    title = {Inference for a large directed acyclic graph with unspecified interventions},
    journal = {Journal of Machine Learning Research},
    year = {2023}
}

Implementing these algorithms is error-prone and this code project is still in development. Please file an issue if you encounter any error when using the code. I will be grateful to be informed.