Welcome to the RandomCausalGraphs repository! This guide will introduce you to the functionalities of the CausalGraph
class, which is designed for creating synthetic causal graphs and simulating both observational and interventional data using various structural equation models (SEMs).
Check out this blog post for a deeper explanation of this package!
CausalGraph
is a Python class that allows users to generate random Directed Acyclic Graphs (DAGs) based on specified parameters. It supports various graph generation methods and SEM types, making it versatile for simulating complex causal structures in computational experiments.
- Erdos-Renyi (ER): Randomly create edges between nodes with a fixed probability.
- Barabasi-Albert (SF): Nodes are added sequentially and connected to existing nodes with preferential attachment. The parameter
m
is calculated based on the edge probability input parameter. See lines 72-78 insynthetic_causal_graphs.py
for details.
The class supports a wide range of SEM types with additive and non-additive noise. sem_type
parameter shown in brackets:
- Linear SEMs: A linear model with additive noise. The noise variable can be sampled from Gaussian ('gauss'), exponential ('exp'), Gumbel ('gumbel'), uniform ('uniform') distributions.
- Non-linear SEMs: multi-layer perceptron or multiple interaction model. Both have additive noise ('mlp' or 'mim') or non-additive noise versions ('mlp-non-add' or 'mim-non-add') respectively.
- Discrete and other models: 'logistic', 'poisson'
- Gaussian processes: 'gp' or 'gp-add' depending on whether parent nodes are modelled jointly in a multi-dimensional GP or independently by applying a GP to each parent node and then summing the result.
The w_ranges
parameter specifying the range of weights for the causal links between nodes:
- Default range:
((-2.0, -0.5), (0.5, 2.0))
- Note weights are sampled from disjoint uniform distributions within these specified ranges.
To install libraries refer to requirements.txt
file.
from src.synthetic_causal_graphs import CausalGraph
# Initialize a CausalGraph with 100 nodes using Erdos-Renyi model and Gaussian SEM type
graph = CausalGraph(n_nodes=100, p=0.1, graph_type='ER', sem_type='gauss', seed=42)
To sample observational data from the graph:
# Simulate observational data from the graph
# Shape of X is (n_samples, n_nodes). The columns follow the topological order of the nodes, that is 0, 1, 2,...n_nodes-1.
X = graph.simulate_sem(n_samples=500)
To simulate data under an intervention (e.g., setting a node to a fixed value):
# Simulate data with an intervention on node 10, setting its value to 0
# Shape of X_intervened is (n_samples, n_nodes). The columns follow the topological order of the nodes, that is 0, 1, 2,...n_nodes-1.
X_intervened = graph.simulate_sem(n_samples=500, intervened_node=10, intervened_value=0)
The fitness function allows you to calculate an outcome Y
from the samples by computing the weighted mean of selected variables and adding Gaussian noise. The selection of variables is based on a specified strategy, either midpoint
or last_few
.
Mathematically, the fitness function can be described as:
Y = ( X ⊙ M )⋅θ + noise
where:
- ( X ) is the matrix of samples with shape (n_samples, n_nodes).
- ( M ) is a sparse mask matrix with the same shape as ( X ). It is used to select specific columns (variables) based on the sampling strategy.
- ( ⊙ ) denotes the element-wise multiplication.
- ( θ ) is the vector of weights for the selected variables.
- ( noise ) is Gaussian noise added to the weighted sum.
The possible strategies used to sample the parent nodes (variables) are:
- Midpoint: Selects variables around the midpoint of the total number of nodes.
- Last Few: Selects variables from the last few nodes.
The weights θ are sampled from a disjoint uniform distribution with default values ([-2.0, -0.5]) and ([0.5, 2.0]).
Here is an example of how to use the fitness
function:
import numpy as np
from src.synthetic_causal_graphs import CausalGraph
from src.utils import fitness
# Generate samples using CausalGraph
n_nodes = 1000
p = 0.2
n_samples = 100
seed = 42
graph_type = 'ER'
sem_type = 'mlp'
model = CausalGraph(n_nodes, p, graph_type, sem_type, seed=seed)
X = model.simulate_sem(n_samples, noise_scale=1.0)
# Calculate fitness
fitness_values, sampled_indices, theta = fitness(X, noise_std=0.1, proportion=0.1, seed=seed, strategy='last_few')
print("Fitness Values:", fitness_values)
print("Sampled Indices (parent nodes):", sampled_indices)
print("Theta (Weights):", theta)
Using a fixed seed ensures that the generated graphs, sampled data, and fitness calculations are reproducible. Here's how the seed impacts different aspects of the simulation:
With a fixed seed, the structure of the generated graph remains the same across multiple class instantiations with the same input parameters. This includes the same set of nodes and edges.
When sampling observational data, using the same seed ensures that the sampled data is identical across different runs of simulate_sem
.
Similarly, using the same seed for interventional data ensures that the interventional samples are consistent across runs, provided the intervention is applied to the same node with the same value. Note this means the ancestors of the intervened node will have the same values as sampling observational data with the same seed. This is useful for calculating the counterfactual!
Using the same seed in the fitness function ensures that the same indices (parent nodes) and weights are sampled, resulting in consistent fitness values across runs. The mask matrix 𝑀 and the weights 𝜃 will be the same for repeated runs, ensuring that the computed fitness values are reproducible.
By setting the seed parameter consistently, you can ensure the repeatability of your experiments, making your simulations and results reliable and reproducible.