TATTER (Two-sAmple TesT EstimatoR) is a tool to perform two-sample hypothesis test. The two-sample hypothesis test is concerned with whether distributions p(x) and q(x) are different on the basis of finite samples drawn from each of them. This ubiquitous problem appears in a legion of applications, ranging from data mining to data analysis and inference. This implementation can perform the Kolmogorov-Smirnov test (for one-dimensional data only), Kullback-Leibler divergence, and Maximum Mean Discrepancy (MMD) test. The module perform a bootstrap algorithm to estimate the null distribution, and compute p-value.
numpy
, matplotlib
, sklearn
, joblib
, tqdm
, pathlib
-
The employed implementation of the Kullback-Leibler divergence is slow and generating a few thousands of bootstrap realizations when the sample size is large (n, m >1000) is not practical.
-
The provided tests reproduce Figures X, X, and X in the paper. Running all of these tests takes ~30 minutes. If your are impatient to reproduce one of the figures try
mnist_digits_distance.py
first.
[1]. A. Farahi, Y. Chen "TATTER: A hypothesis testing tool for multi-dimensional data." Astronomy and Computing, Volume 34, January (2021).
[2]. A. Gretton, B. M. Karsten, R. J. Malte, B. Schölkopf, and A. Smola, "A kernel two-sample test." Journal of Machine Learning Research 13, no. Mar (2012): 723-773.
[3]. Q. Wang, S. R. Kulkarni, and S. Verdú, "Divergence estimation for multidimensional densities via k-nearest-neighbor distances." IEEE Transactions on Information Theory 55, no. 5 (2009): 2392-2405.
[4]. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, "Numerical recipes." (1989).
Run the following to install:
pip install tatter
To start using TATTER, simply use from tatter import two_sample_test
to
access the primary function. The exact requirements for the inputs are
listed in the docstring of the two_sample_test() function further below.
An example for using TATTER looks like this:
from tatter import two_sample_test
test_value, test_null, p_value =
two_sample_test(X, Y,
model='MMD',
iterations=1000,
kernel_function='rbf',
gamma=gamma,
n_jobs=4,
verbose=True,
random_state=0)