/Reproducibility-EstimationOfCopulasViaMMD

Replication for the article 'Estimation of copulas via Maximum Mean Discrepancy'

Primary LanguageR

Replication for the article 'Estimation of copulas via Maximum Mean Discrepancy'

Pierre Alquier, Badr-Eddine Chérief-Abdellatif, Alexis Derumigny, and Jean-David Fermanian


This file describes the procedure to use in order to replicate the numerical results of the article 'Estimation of copulas via Maximum Mean Discrepancy'. In order to do this, the files are divided into two workflows. The main workflow simulates and produces the figures for the MMD-based estimators while the second workflow does a similar process for the MMD-based confidence intervals.

Main workflow

The main process to replicate the figures is composed of the following files :

  1. Several simulations files, that do all the simulations and estimations (for all estimators):

    File name n Family Tau Type contam % contam $\gamma$ Init Kernel
    1-simus_gamma_tau_N.R 1000 N all top_left 0%, 5% all both all
    1-simus_gamma_tau_fam.R 1000 C, G, F all top_left 0%, 5% all both G
    1-simus_typeContam.R 1000 N 0.5 all 0.25% to 5% optimal random G
    1-simus_contamFam.R 1000 C, G, F 0.5 top_left 0% to 5% optimal random G
    1-simus_n.R all N 0.5 top_left 0%, 5%, 10% optimal random G
    1-simus_MO_gamma_par.R 1000 MO all top_left 0%, 5% all both G
    1-simus_MO_nOutliers.R 1000 MO par = 0.5 top_left 0% to 5% optimal random G

    Each column of the previous table describes one aspect of the simulation experiment that has been done.

    • n: the sample size used for the simulation.

    • Families: the parametric families of copulas from which the simulation has been done.

      • "N": Normal (Gaussian)
      • "C": Clayton
      • "G": Gumbel
      • "F": Frank
      • "MO": Marshall-Olkin
    • Tau: the Kendall's tau between the two simulated variables.

      • "all": all Kendall's tau in the interval $(-1,1)$ for the Normal copula and $(0,1)$ for the others
      • "0.5": only $tau=0.5$
      • "par = 0.5": the parameter of the Marshall-Olkin copula is fixed at 0.5.
    • type contam: the type of the contamination used.

      • "top_left": contamination by outliers uniformly distributed in the upper-left corner of the square $[0,1]^2$.
      • "all": the contamination is done by all the 9 types of outliers considered in the article.
    • % contam: the percentage of observations that have been contaminated by outliers.

    • $\gamma$: the way of choosing the tuning parameter𝛾

      • "all": the estimation is done for all considered values of the parameter𝛾.
      • "optimal": the estimation is only done using the optimal value of the parameter𝛾.
    • Init: the method of initialization of the algorithm.

      • "all": both random initialization and initialization by the empirical Kendall's tau are done.
      • "random": only the initialization with a random parameter is used.
    • Kernel: the kernel used.

      • "all": the following kernel are used: "gaussian", "gaussian.Phi", "exp-l2", "exp-l2.Phi", "exp-l1", "exp-l1.Phi".
      • "G": only the kernels "gaussian" and "gaussian.Phi" are used.

  2. A main aggregation script in the file 2-aggregation.R that collect the outputs from all these scripts, merge them, and compute the MSE, average computation time and other statistics.

  3. Several RMarkdown documents that process these information and construct the figures and the tables of the paper.

    They can be generated by R using the following commands.

    rmarkdown::render("3-main_figures.Rmd")
    rmarkdown::render("3-dashboard_paramCopulas.Rmd")
    rmarkdown::render("3-dashboard_MO.Rmd")
    

Specific workflow for confidence intervals

The second one is composed of the following files :

  • confint_simulations.R : contains the code to run the simulations for the MMD-based confidence intervals and store the results in a file.

  • confint_figures.Rmd : contains the code to read the simulations made by confint_simulations.R, and to make the two (sub-) figures based on this data.

Requirements

This study was done using R version 4.1.0-4.1.2 with the following packages:

Package Version
MMDCopula 0.2.0
VineCopula 2.4.3
pbapply 1.5.0
here 1.0.1
tidyverse 1.3.1
ggplot2 3.3.5
purrr 0.3.4
dplyr 1.0.7
tidyr 1.1.4
plotly 4.10.0
rmarkdown 2.11
flexdashboard 0.5.2

They can be all installed using the following command. (Note that the tidyverse package will install ggplot2, purrr, dplyr, tidyr as dependencies.)

install.packages(c("MMDCopula", "VineCopula", "pbapply", "here", "tidyverse",
                   "plotly", "rmarkdown", "flexdashboard"))

To do these simulations, it took around two months of computations on a Windows 10 laptop, with a processor Intel Core i7-3630 2.40GHz, running several R sessions in parallel.