Welcome to the supplementary material for the paper:
- Edouard Fouché & Klemens Böhm. 2019. Monte Carlo Dependency Estimation. In 31st International Conference on Scientific and Statistical Database Management (SSDBM ’19), July 23–25, 2019, Santa Cruz, CA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3335783.3335795
This repository contains the original implementation of MCDE/MWP and the information to reproduce the experiments in the paper. For this reason, it is partially frozen at the time of publication. To use the most recent and maintained implementation of MCDE/MWP, please visit this repository. See also the experiments from our follow-up study.
This repository is released under the AGPLv3 license. Please see the LICENSE.md file.
Requirements : (Oracle JDK 8 or OpenJDK 8) and sbt
The project is built with sbt (version 1.2.8). You can compile, package or run the project as follows:
sbt compile
sbt package
sbt "run <arguments>"
You can also export a "fat" jar, including all dependencies and scala libraries using sbt-assembly
:
sbt assembly
This creates a jar in the folder target/scala-2.12/
named MCDE-experiments-<version>.jar
, which can be run from java
(no sbt/scala installation required). The version of the package at the time of the experiments is 1.0.
Once you have built the jar, you can run it as follows:
java -jar target/scala-2.12/MCDE-experiments-1.0.jar <arguments>
The application accepts various arguments. The first two (-t <task>
and -f <file>
) are mandatory. Arguments are not case sensitive and can be given in any order.
-t <task>
: Task to perform. Possible choices:EstimateDependency
: Estimates the dependency of a single subspace (with 2 or more dimensions).EstimateDependencyMatrix
: Estimates the 2-D dependency matrix, i.e., the dependency of each pair in the data.
-f <file>
: the path to the data source (a comma-separated text file with 1 line header) in your system.-a <approach>
: The approach to use for dependency estimation. Possible choices:- MCDE approaches:
MWP
(Default) - Benchmark approaches (see II: Related Work):
HiCS
,TC
,II
,MS
,UDS
,MAC
,CMI
- MCDE approaches:
-p <plevel>
: Level of parallelism to use. Possible choices:0
(Default, running single core)1
(The number of threads is set by the program automatically)- Any integer >1 (The user specifies the number of threads explicitly)
- Please note that, in the case of:
EstimateDependency
, the iterations ofMWP
will run in parallel (not supported for benchmark approaches).EstimateDependencyMatrix
, coefficients are estimated in parallel (thus, supported for all approches).
Additional argument for MWP
:
-m <M>
: the number of Monte Carlo simulations. More simulations lead to more accurate estimates (see Theorem 4), but also increases runtime linearly. Default: 50
Additional argument for EstimateDependency
:
-d <dimensions>
: Dimensions of the subspace on which the dependency should be estimated (a list of integers, comma-separated, starting from 0). For instance, if0,1
is selected only the first two columns of the data are taken into account, if0,2
is selected only column 1 and 3. If not specified, the dependency is estimated on the full space.
We provide a sample of the data used for the experiments in src/test/resources/data
for testing.
- Independent data (2-D)
fouchee@Modena:~/git/MCDE$ java -jar target/scala-2.12/MCDE-experiments-1.0.jar -t EstimateDependency -f src/test/resources/data/Independent-2-0.0.csv -a MWP -m 50 -d 0,1
08:05:41.627 [main] INFO Main$ - Working directory: /home/fouchee/git/MCDE
08:05:41.725 [main] INFO Main$ - Raw parameters given: ["-t", "EstimateDependency", "-f", "src/test/resources/data/Independent-2-0.0.csv", "-a", "MWP", "-m", "50", "-d", "0,1"]
08:05:41.809 [main] WARN Main$ - Parallelism level not specified, running on single core.
08:05:41.854 [main] INFO Main$ - Usage: -t EstimateDependency -f <file> -a <approach> -m <M> -d <dimensions> -p <plevel>
0.4753742163666849
Data Loading time: 76.138745 ms (cpu), 76.460393 ms (wall)
Preprocessing time: 32.325762 ms (cpu), 36.009404 ms (wall)
Computation time: 37.12567 ms (cpu), 37.323588 ms (wall)
- Linear dependent data (2-D)
fouchee@Modena:~/git/MCDE$ java -jar target/scala-2.12/MCDE-experiments-1.0.jar -t EstimateDependency -f src/test/resources/data/Linear-2-0.0.csv -a MWP -m 50 -d 0,1
08:06:10.943 [main] INFO Main$ - Working directory: /home/fouchee/git/MCDE
08:06:11.033 [main] INFO Main$ - Raw parameters given: ["-t", "EstimateDependency", "-f", "src/test/resources/data/Linear-2-0.0.csv", "-a", "MWP", "-m", "50", "-d", "0,1"]
08:06:11.112 [main] WARN Main$ - Parallelism level not specified, running on single core.
08:06:11.153 [main] INFO Main$ - Usage: -t EstimateDependency -f <file> -a <approach> -m <M> -d <dimensions> -p <plevel>
0.9999998904398533
Data Loading time: 72.585293 ms (cpu), 73.281843 ms (wall)
Preprocessing time: 31.995782 ms (cpu), 32.164805 ms (wall)
Computation time: 35.695044 ms (cpu), 35.812886 ms (wall)
- Independent data (2-D)
fouchee@Modena:~/git/MCDE$ java -jar target/scala-2.12/MCDE-experiments-1.0.jar -t EstimateDependencyMatrix -f src/test/resources/data/Independent-5-0.0.csv -a MWP -m 50 -p 0
08:06:46.343 [main] INFO Main$ - Working directory: /home/fouchee/git/MCDE
08:06:46.434 [main] INFO Main$ - Raw parameters given: ["-t", "EstimateDependencyMatrix", "-f", "src/test/resources/data/Independent-5-0.0.csv", "-a", "MWP", "-m", "50", "-p", "0"]
08:06:46.536 [main] WARN Main$ - Running with parallelism level: 0
08:06:46.580 [main] INFO Main$ - Usage: -t EstimateDependencyMatrix -f <file> -a <approach> -m <M> -p <plevel>
0.00 0.53 0.41 0.44 0.43
0.53 0.00 0.55 0.38 0.57
0.41 0.55 0.00 0.43 0.57
0.44 0.38 0.43 0.00 0.58
0.43 0.57 0.57 0.58 0.00
Data Loading time: 92.681802 ms (cpu), 92.88746 ms (wall)
Preprocessing time: 39.954535 ms (cpu), 39.999726 ms (wall)
Computation time: 85.453125 ms (cpu), 86.205151 ms (wall)
For larger tables (e.g., with 100 dimensions), parallelism provides a significant runtime improvement:
- Independent data (100-D, no parallelism)
fouchee@Modena:~/git/MCDE$ java -jar target/scala-2.12/MCDE-experiments-1.0.jar -t EstimateDependencyMatrix -f src/test/resources/data/Independent-100-0.0.csv -a MWP -m 50 -p 0
08:07:14.583 [main] INFO Main$ - Working directory: /home/fouchee/git/MCDE
08:07:14.670 [main] INFO Main$ - Raw parameters given: ["-t", "EstimateDependencyMatrix", "-f", "src/test/resources/data/Independent-100-0.0.csv", "-a", "MWP", "-m", "50", "-p", "0"]
08:07:14.961 [main] WARN Main$ - Running with parallelism level: 0
08:07:15.063 [main] INFO Main$ - Usage: -t EstimateDependencyMatrix -f <file> -a <approach> -m <M> -p <plevel>
0.00 0.50 0.44 0.48 0.43 0.51 0.57 0.59 0.58 0.53 ... (truncated)
0.50 0.00 0.50 0.55 0.52 0.59 0.59 0.55 0.55 0.50 ... (truncated)
0.44 0.50 0.00 0.45 0.44 0.54 0.49 0.44 0.46 0.66 ... (truncated)
0.48 0.55 0.45 0.00 0.51 0.43 0.54 0.47 0.55 0.35 ... (truncated)
0.43 0.52 0.44 0.51 0.00 0.51 0.47 0.44 0.53 0.55 ... (truncated)
0.51 0.59 0.54 0.43 0.51 0.00 0.60 0.56 0.46 0.56 ... (truncated)
0.57 0.59 0.49 0.54 0.47 0.60 0.00 0.57 0.40 0.43 ... (truncated)
0.59 0.55 0.44 0.47 0.44 0.56 0.57 0.00 0.45 0.60 ... (truncated)
0.58 0.55 0.46 0.55 0.53 0.46 0.40 0.45 0.00 0.53 ... (truncated)
0.53 0.50 0.66 0.35 0.55 0.56 0.43 0.60 0.53 0.00 ... (truncated)
... ... ... ... ... ... ... ... ... ...
(truncated)
Data Loading time: 269.652442 ms (cpu), 280.788656 ms (wall)
Preprocessing time: 96.618713 ms (cpu), 96.945092 ms (wall)
Computation time: 2596.539562 ms (cpu), 2643.743548 ms (wall)
- Independent data (100-D, with parallelism)
fouchee@Modena:~/git/MCDE$ java -jar target/scala-2.12/MCDE-experiments-1.0.jar -t EstimateDependencyMatrix -f src/test/resources/data/Independent-100-0.0.csv -a MWP -m 50 -p 1
08:07:34.170 [main] INFO Main$ - Working directory: /home/fouchee/git/MCDE
08:07:34.260 [main] INFO Main$ - Raw parameters given: ["-t", "EstimateDependencyMatrix", "-f", "src/test/resources/data/Independent-100-0.0.csv", "-a", "MWP", "-m", "50", "-p", "1"]
08:07:34.543 [main] WARN Main$ - Running with default parallelism level.
08:07:34.674 [main] INFO Main$ - Usage: -t EstimateDependencyMatrix -f <file> -a <approach> -m <M> -p <plevel>
0.00 0.47 0.48 0.41 0.46 0.42 0.49 0.51 0.57 0.50 ... (truncated)
0.47 0.00 0.62 0.53 0.53 0.51 0.52 0.48 0.52 0.63 ... (truncated)
0.48 0.62 0.00 0.35 0.42 0.57 0.48 0.49 0.48 0.64 ... (truncated)
0.41 0.53 0.35 0.00 0.41 0.45 0.50 0.39 0.57 0.42 ... (truncated)
0.46 0.53 0.42 0.41 0.00 0.53 0.42 0.39 0.51 0.49 ... (truncated)
0.42 0.51 0.57 0.45 0.53 0.00 0.47 0.49 0.46 0.52 ... (truncated)
0.49 0.52 0.48 0.50 0.42 0.47 0.00 0.63 0.45 0.44 ... (truncated)
0.51 0.48 0.49 0.39 0.39 0.49 0.63 0.00 0.55 0.59 ... (truncated)
0.57 0.52 0.48 0.57 0.51 0.46 0.45 0.55 0.00 0.58 ... (truncated)
0.50 0.63 0.64 0.42 0.49 0.52 0.44 0.59 0.58 0.00 ... (truncated)
... ... ... ... ... ... ... ... ... ...
(truncated)
Data Loading time: 263.604825 ms (cpu), 275.675212 ms (wall)
Preprocessing time: 123.729107 ms (cpu), 123.756542 ms (wall)
Computation time: 51.002643 ms (cpu), 1123.108306 ms (wall)
In this section, we explain how to reproduce the experiments from our paper.
The experiments create about 20MB of data and require about 16 hours
on a server with 20 cores at 2.2 Ghz and 32GB RAM, using Java Open-JDK 8 and Scala 2.12.8.
Results are saved in the folder experiments
as .csv
files, along with logs.
Evaluate the statistical power of MCDE/MWP and of the benchmark approaches against a panel of dependencies.
sbt "run com.edouardfouche.experiments.Power" # ~ 7 hours, 3MB data
Evaluate the influence of parameter M w.r.t. the statistical power of MCDE/MWP.
sbt "run com.edouardfouche.experiments.PowerM" # ~ 10 minutes, 1MB data
Evaluate the sensitivity of each approach w.r.t. n, the number data points.
sbt "run com.edouardfouche.experiments.PowerN" # ~ 2.5 hours, 4MB data
Evaluate the robustness of each approach against discrete data.
sbt "run com.edouardfouche.experiments.PowerDiscrete" # ~ 3 hours, 5MB data
Evaluate the scalability of each approach w.r.t. n, the number data points.
sbt "run com.edouardfouche.experiments.ScalabilityN" # ~ 45 minutes, 5MB data
Evaluate the scalability of each approach w.r.t. d, the number dimensions.
sbt "run com.edouardfouche.experiments.ScalabilityD" # ~ 2.5 hours, 2MB data
Then, you can use the Jupyter notebooks in folder visualize
to reproduce
the plots from the publication. By the time of the experiments, we use the following Python packages:
# Name Version
matplotlib 2.0.2
numpy 1.13.1
pandas 0.20.3
seaborn 0.8
We welcome contributions to the repository and bug reports on GitHub.
For questions and comments, please contact edouard.fouche@kit.edu
, or open an issue.
-
We propose a standalone, deployment-ready version of MCDE in this repository.
-
See also the experiments from our follow-up study.
-
We developed a data generator for these experiments, which we released independently here.
This repository features code from other projects. We would like to acknowledge the following contributions:
-
Hoang Vu Nguyen for providing the source for
HiCS
,TC
,II
,MS
,UDS
,MAC
,CMI
in the code dump for the paperNguyen, Hoang-Vu, Panagiotis Mandros, and Jilles Vreeken. "Universal dependency analysis." Proceedings of the 2016 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2016.
-
The ELKI project for the R*-Tree index structure, that we use to accelerate the kNN queries in
TC
andII
. -
Hendrik Braun, for the implementation of
TC
,II
,MS
, as a part of his Bachelor's thesis at KIT. -
This work was supported by the DFG Research Training Group 2153: ‘Energy Status Data – Informatics Methods for its Collection, Analysis and Exploitation’ and the German Federal Ministry of Education and Research (BMBF) via Software Campus (01IS17042).