VAEAC for Abalone dataset
aliamini-uq opened this issue · 3 comments
Dear shapr
,
First of all, congrats on your interesting paper: "Using Shapley Values and Variational Autoencoders to Explain Predictive Models with Dependent Mixed Features". In my opinion, the introduction of VAEAC
approach seems promising.
However, to better understand the paper and benefit from reproducible results for end-users, I suggest publicizing the codes related to Abalone dataset, at least. In particular, I am a little bit confused about how to compare different approaches. For example, when I compare the performance of VAEAC
with independence
and empirical
approaches, should I first change the dataset using on-hot encoding for the independence
and empirical
approaches? However, I should not change the dataset since VAEAC has an internal mechanism. Am I right?
In addition, other sections of this paper are very interesting, but need more elaboration such as the introduction of "EC3 = EPEv" and "Fig. 10". Thus, It would be great if you share the codes of this outstanding work.
P.S: This question was originally asked in LHBO/ShapleyValuesVAEAC#1
Kind regards,
A
Dear A,
First and foremost, we thank you very much for the compliments.
I will respond to both issues here since you raised this question here and at LHBO/ShapleyValuesVAEAC#1.
The Abalone
dataset is available at https://github.com/LHBO/ShapleyValuesVAEAC/tree/main/data and is used in the corresponding vignette. Change “Rings ~ Diameter + ShuckedWeight + Sex” to “Rings ~.” if you want to include all the features and change the approaches to the other approaches you want too. The plot_MSEv_eval_crit()
and plot_SV_several_approaches()
functions in plot.R will help you evaluate the results.
In the developer version of shapr
here at this GitHub repository, both the independence
and vaeac
methods support categorical data. The former handles the levels directly, while vaeac
will one-hot-encode the categorical features internally to support categorical data. Thus, the user does not have to pre-process the data before sending them to explain()
. Note that explain()
here is different from explain()
in https://github.com/LHBO/ShapleyValuesVAEAC due to breaking changes in shape.
The empirical
method does NOT support categorical data due to the distance measure used (scaled version of the Mahalanobis distance); see Explaining individual predictions when features are dependent: More accurate approximations to Shapley values If you want to use the empirical
approach on datasets with categorical data, then you either have to change the distance measure (i.e., write your version of empirical
with another measure) or pre-process the data using encodings; see our comments at the end of section 3.2 in A Comparative Study of Methods for Estimating Conditional Shapley Values and When to Use Them. Note that for the latter, you would obtain shapley value explanations for the one-hot dummy features, not the original categorical features.
The developer version of shapr
computes the EC3 automatically. It is outputted by explain()
and is called MSEv
. We are sorry about the possibility of any misunderstandings related to the names, as this MSEv
here is NOT the same as the EC2, which is also called MSEv
in the paper. Thus, we stress that EC3 in the paper is the same as the MSEv in the shapr
package. The criterion was introduced in Understanding global feature contributions with additive importance measures and SHAPLEY EXPLAINABILITY ON THE DATA MANIFOLD and we refer to them for more in-depth explanations, also see the documentation and vignette. Use plot_MSEv_eval_crit()
if you want to compare the criterion for several methods.
In Figure 10, we wanted to illustrate the inferred distributions from which the vaeac
approach generates the MC samples. That is, for each observation and coalition, the observation gets sent through the masked encoder and generates a latent distribution. From this latent distribution, we generate K latent representations, and these are sent through the decoder to generate K different inferred MVN, for which we sample one observation from each, which will constitute the MC samples. To create a similar figure using the new version of shapr
, one could go into the code and return the parameters(means and standard deviations) in the distr
object on line 1470 in approach_vaeac_torch_modules.R” for continuous features and then use ggplot2.
Best regards,
Lars
Dear @LHBO,
Many thanks for this quick and informative feedback. Problem solved!
P.S: Without a doubt, the main shapr paper and the follow-up studies (A Comparative Study of Methods for Estimating Conditional Shapley Values and When to Use Them & Using Shapley Values and Variational Autoencoders to Explain Predictive Models with Dependent Mixed Features) are must-read papers for XAI community members. Great job @martinju and @LHBO!
Kind regards,
A