scmopy is a composite package for causal discovery/analysis using several novel types of Structural Causal Models Optimization algorithms.
scmopy provides Distribution-Agnostic methods in identifying causality; in other words, users can deviate from the necessity of satisfying any specific distributional assumptions as regards to the dataset, and as regards to the whole process of causal estimation to hypothesis-testing.
The package is mainly structured in three parts:
- ESA-2SCM (Elastic Segment Allocation-based Two-Stage Least Squares SCM)
- ESA-2SCM is a new method for detecting causality based on the Elastic Segment Allocation-based synthetic instrumental variables with 2SLS application for estimating structural causal models.
- For details and documentation, please refer to my original article:
- Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230>
- Gradient Non-Gaussian SCM
- Gradient Non-Gaussian SCM incorporates the information of higher order moment structures assuming non-gaussianity to determine the true causal direction.
- Gradient Non-Gaussian SCM is a customized implementation of S.Shimizu and Y.Kano's conceptualization of nnSEM. Specifically, the quadratic objective function based on the difference between the sample moments and theoretical moments is optimized via gradient method (defaulting to BFGS) instead of performing GLS.
- For details regarding the concepts of the original nnSEM, please refer to:
- S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.
- Auto-SCM Selector for Optimization
- The SCM Selector automatically determines the optimal model via pre-inspecting the dataset.
- Internally, it utilizes voting methods in combination with multiple hypothesis testing techniques on the data's original distribution for the precision of model determination: ESA-2SCM is selected as the basemodel if the pre-inspection result suggests gaussianity, otherwise the Gradient Non-Gaussian SCM is selected.
For further details on each model's algorithm, refer to the Models Overview section below.
-
Python3
-
numpy
-
pandas
-
scipy
To install the scmopy package, use pip
as follows:
pip install scmopy
from scmopy.nongaussian_scm import GradientNonGaussianScm
from scmopy.gaussian_scm import Esa2Scm
from scmopy.model_selection import ScmSelector
import numpy as np # to generate sample data for demonstration
Gradient Non-Gaussian SCM
# Generate sample data for demonstration
N = 10000
np.random.seed(11)
x2 = np.random.gamma(shape=0.5, scale=0.5, size=N) # non-gaussian sample
noise = np.random.random(size=N) # non-gaussian noise
b12 = 1.8 # True Causal Coefficient set as 1.8
x1 = b12 * x2 + noise # True Causal Direction set as x2 -> x1
# Initialize GradientNonGaussianSCM with no prior knowledge on the causal direction
scm = GradientNonGaussianScm(x1, x2, prior_knowledge=None, unit_var=False)
# Fit the model
scm.fit(alpha=0.1) # Set alpha for chi2 test using the test statistic T2 for determination of causal direction
# To confirm the estimated True Causal Direction
print(scm.causal_direction)
# To confirm the estimated True Causal Coefficient
print(scm.causal_coef)
# To confirm the test statistic (T2) and p-value for hypothesis testing on the Causal Direction
print(scm.test_statistic)
print(scm.p_value)
# To confirm the fit score
print(scm.score)
# For result summary:
scm.summary()
x2->x1 | x1->x2 | |
---|---|---|
Causal Direction | Decisive True | Decisive False |
Causal Coefficient | 1.8010473505277451 | 0.3574865003509795 |
Test Statistic | 3.4739238216348447 | 646.2378958215542 |
P-value | 0.6273369079064182 | 0.0 |
Reject H0 | False | True |
Goodness of Fit | 0.82038 | - |
ESA-2SCM
# Initialize ESA-2SCM with no prior knowledge on the causal direction
scm = Esa2Scm(x1, x2, prior_knowledge=None)
# Fit the model, using Synthetic IV generation method(syniv_method, default='esa') to estimate causality
# Adjust the parameter M(default=2) to manually manage the degree of correlation between the Synthetic IVs (2SLS-converted) and the respective endogenous variables
scm.fit(syniv_method="esa", M=3)
# To confirm the estimated True Causal Direction
print(scm.causal_direction)
# To confirm the estimated True Causal Coefficient
print(scm.causal_coef)
# To check the degree of correlation between the generated Synthetic IVs and the endogenous variables (x1 and x2, respectively):
print(scm.corr_x1_to_slsiv)
print(scm.corr_x2_to_slsiv)
# To confirm the true goodness of fit of the ESA-2SCM for determination of the causal direction:
print(scm.esa2scm_score)
# With causal direction determined via ESA-2SCM, to confirm the posthoc goodness of fit of the Regression Model using original variables:
print(scm.posthoc_score)
# For result summary:
scm.summary()
x2->x1 | x1->x2 | |
---|---|---|
Causal Direction | True | False |
Causal Coefficient | 1.804851 | 0.385192 |
Goodness of Fit | 0.39664 | 0.35318 |
Corr (2SLS_IV-Explanatory) | 0.694676 | 0.774985 |
Posthoc Goodness of Fit | 0.82038 | - |
Auto SCM Selector for Optimal SCM Selection
# Initialize Auto SCM Selector
selector = ScmSelector(x1, x2)
# Fit the selector
selector.fit(alpha=0.15, voting_strategy='strict')
# Confirm optimal model for the given dataset x1 and x2.
best_scm = selector.selected_scm
# Fit using the selected model
best_scm.fit()
# Confirm the estimated True Causal Direction
print(best_scm.causal_direction)
# Confirm the estimated True Causal Coefficient
print(best_scm.causal_coef)
# For result summary:
best_scm.summary()
x2->x1 | x1->x2 | |
---|---|---|
Causal Direction | Decisive True | Decisive False |
Causal Coefficient | 1.8010473505277451 | 0.3574865003509795 |
Test Statistic | 3.4739238216348447 | 646.2378958215542 |
P-value | 0.6273369079064182 | 0.0 |
Reject H0 | False | True |
Goodness of Fit | 0.82038 | - |
Gradient Non-Gaussian SCM accounts for the case where the exogenous variable or the noise follows non-gaussian distribution.
ESA-2SCM, on the other hand, accounts for the case where the noise follows gaussian distribution.
In scmopy, these two models are deployed in a complementary manner, ultimately enabling Distribution-Agnostic SCM optimization for causal discovery.
Gradient Non-Gaussian SCM
Gradient Non-Gaussian SCM is a customized implementation of S.Shimizu and Y.Kano's conceptualization of nnSEM (2008). More specifically:
- the quadratic objective function based on the difference between the sample moments and theoretical moments is optimized via gradient method (defaulting to BFGS) instead of performing GLS.
- Weight matrix is defined as
$\hat{\Sigma}$ and Pseudo-inverse matrix$\hat{\Sigma}^+$ is used instead if the inverse matrix of$\hat{\Sigma}$ cannot be obtained.
With
Reduced form of the above with respect to
With
Assumption that the SEM is identifiable using moment structures is equivalent of:
Denote sample counterparts to the first and second to fourth theoretical moment structures above as:
Then with
With
S.Shimizu and Y.Kano (2008) obtains
Gradient Non-Gaussian SCM in scmopy adopts instead a gradient method (defaulting to BFGS) for solving the above with
and with
Assuming unit variance
so that
Applying chain rule,
Then, the basic form of gradient descent can be written as:
Generalizing,
Pseudo-inverse matrix of
Application as regards to the determination of true causal direction is identical to the case of nnSEM (S.Shimizu and Y.Kano, 2008), as follows.
Suppose that we are interested in identifying the true causal direction between the two random variables
with,
The first- and second-order moment structures of
As there are as many parameters as the sample moments, models
Now, expanding up to third and fourth order moment structures, S.Shimizu and Y.Kano (2008) prove that under the satisfaction of the following three conditions:
(1) Either the exogenous variable
(2)
(3)
models
That is to say,
with
and,
to get the isolated quantity from the fourth order moment (
if (1)
so that the models
More specifically,
For
to hold,
should hold. Solving for the matrix,
Similar derivation process can be applied for the fourth-order moment.
Thus, the two models
and with
Test statistic
or, in case of Gradient Non-Gaussian SCM in scmopy,
where
Reference and the original conceptualization of nnSEM by S.Shimizu and Y.Kano (2008):
- S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.
ESA-2SCM
ESA-2SCM is a new method for detecting causality based on Elastic Segment Allocation-based synthetic instrumental variables with 2SLS application for estimating structural causal models.
Suppose that you are interested in discovering the causal relationship between
Estimation of the above equation under standard OLS is structurally biased and inconsistent due to endogeneity:
where
thus,
The estimators are also asymptotically inconsistent, as:
ESA-2SCM provides a countermeasure to such problem, enabling the determination of true causal direction and estimation of the true causal coefficient through the following procedures.
- Vector definition:
- Sorting:
- Set initial number of segments (M):
- Segment size allocation:
- Elastic adjustment algorithm for adjusting the number of segments:
- Grouping based on the adjusted sizes and number of segments:
- Segment value assignment:
-
Apply 2SLS using the generated Synthetic IV vectors (Z):
- Get
$z_1$ and$z_2$ via applying the process (1) to (7) for$x_1$ and$x_2$ , then perform 2SLS to estimate for:
- Get
Compare fits to determine the true causal direction, and estimate the true causal coefficient from the correctly identified model.
Reference and detailed documentation for the ESA-2SCM algorithm:
- Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230> [ARTICLE LINK]
Examples of running scmopy in Jupyter Notebook are included in scmopy/examples
Should you use the scmopy package, please cite my original article and the original article by S.Shimizu and Y.Kano:
-
Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230> [ARTICLE LINK]
-
S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.
Copyright 2024 Sanghoon Lee (DSsoli). All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.