Report of reviewer 1

Question

Report of reviewer 1

Opened this issue 7 months ago · 0 comments

Associate Editor: Nelle Varoquaux
Reviewer : Julyan Arbel

Paper submitted Aug 10, 2022
Reviews received Oct 11, 2022
Decision: major revision Nov 25, 2022
Paper revised Sep 18, 2023
Reviews received Oct 31, 2023
Paper accepted Nov 20, 2023

First round

The paper focuses on classification methods based on trees that aim at predicting a specific instance in the classification setting. This is motivated by an approximate Bayesian computation application in mind, where prediction is to be done according to one observation only. Such classification methods are called local. The paper proposes a review of such local classification approaches. It also introduces two novel such methods which did not show conclusive results. I think that the authors make a nice job in reviewing the literature about local tree methods for classification, as well as in proposing an extensive implementation of them. I do not have major critical comments about the paper. I rather list some suggestions that could hopefully help with the overall paper presentation.

Thank you for your time, comments and your constructive suggestions. We have addressed your comments in the revised version of the manuscript, and give details below.

Major comments

I find the abstract to be slightly misleading as it does not really reflect the content of the paper: a good part of it is devoted to the ABC motivating example, although the article does not implement it, in the end. Two options would be to either downplay the description of this ABC application in the abstract (which I think should focus on the proposed methods), or to keep the abstract this way but adding an illustrative application to ABC.

Thank you for your comment. We have reformulated the abstract in order to put less focus on ABC and more on random forest.

The paper makes a honest statement about its own proposed methods which turned out being unsatisfactory, due to high computational cost and limited performance. I think this is totally acceptable and the fact that the journal considers such negative results potentially useful to the readership is a good thing. In this direction, I wonder if the authors could provide some more feedback/explanation about what goes wrong with their two approaches. For instance, it is written at the beginning of Section 8 that those approaches “were implemented and compared on a lower dimensional simulation study (same Gaussian examples with only 500 test data and 5 replications) but were dropped of the final comparison due to high computational cost despite poor results.” I think that it would be unfortunate not to include at least some toy illustration of the proposed methods. Maybe the setup that was dropped (same Gaussian examples with only 500 test data and 5 replications) could be considered in the Appendix?

We have re-run the small example and included the results at the end of section 8.2. The results of our methods are not particularily worse than the other methods we compare, but they clearly do not improve the results at the price of a huge computational price (about ~750k the runtime of classic RF). To be honest it is hard to understand why local methods fail to provide better results than global ones. Our best interpretation is in fact that methods based on classic trees (Bagged CARTs, Random Forests) are already quite local, as at each split only the data in the mother node is considered. Hence provided the initial cuts are smart enough, the successive cuts will already have a local flavor. Bagging a large number of trees will attenuate the results of poor trees (those with poor inital cuts), hence an overall good performance.

Following the previous comment: I wonder if some more specific title wouldn’t be in order. Indeed, on top of making a review about local tree methods for classification, the paper also proposes approaches which did not happen conclusive. Would something like “Local tree methods for classification: a review and some dead ends” work?

You are absolutely right, we changed our title as suggested.

Sections 4.2 and 4.3: Regarding the kernel bandwidths, I wonder if the authors thought about using larger values than the maximum absolute value considered when alpha is set to 1? As it is presented, it looks like this value is a kind of maximum possible value to be used, but nothing precludes using larger values actually. Could it lead to better results? I think a discussion about this limiting value would be useful.

Indeed there is no reason alpha=1 should be the maximum value. However as alpha increases the kernel will be more flat, giving more weight to non-neighboring data and getting closer to a classic Random Forest (which can be interpreted as a kernel-based approach with alpha going to infinity), hence reducing the local effect aimed at. We have not tried larger values, especially considering the computational burden of the methods, but we would expect the performance to be similar, as on our small example Multi-K, Uni-K, and Random forests have similar performance.

Minor comments

The Gini index and entropy are alluded to in Section 3 but not defined there, and then are used in later sections. It could be useful to recall their expression from Section 3.

We agree and have added their definition after they are first mentioned in Section 3

The red & green colors used in Figure 1 are not color-blind friendly. I’m not sure about the other Figures. Maybe consider changing them to some more suitable palette?

Thank you for your suggestion. We have now modified the colors in all Figures using the safe color-blind palette in R.

Section 4.2 Unidimensional kernel approach: I would suggest to add “(per covariate)” to this section title in order to specify that unidimensional refers to treating covariates one at a time. Instead, there could be a misunderstanding that this section deals with unidimensional data.

Thank you for your suggestion. We have now modified the section's title

The quantile notation $\mathcal{Q}_\alpha(\dots)$ is quite straightforward, but still would need to be defined somewhere I guess.

We have now added the mathematical definition of the quantile in the manuscript.

The quantile order $\alpha$ is often set to 1. Maybe just state that it amounts to choosing the maximum value of the absolute values considered?

Yes, this has been added

“We observed very few differences”: specify. Maybe something like “We observed very few differences when using a a fixed or a varying bandwidth…”

We reformulated that sentence

7.. “The first term $\tilde{I}(t)$ is important and cannot be omitted contrary to the eager version, because it depends on the covariate index.” This is not so clear at first sight since the notation does not depend on j. Maybe make both and
notations depend on j?

You are right, we modified the notations to make the this statement more clear.

I would ask for some clarification about the bandwidths that are used throughout the paper. The one in Section 4.2 is associated to a univariate Gaussian kernel. I haven’t seen it specified, but I guess it refers to the Gaussian standard deviation $\sigma$ ? Then moving to Section 4.3, I wonder about some possible confusion on how the “scaling matrix” $V$ is used in the kernel. Shouldn’t it be its inverse? Assuming the inverse is right, and in order to be coherent with Section 4.2, shouldn’t the quantiles be squared in the definition of $V$? (since then it would be a covariance matrix, and not a matrix of standard deviations). I’m afraid that otherwise, the choices of quantile values made in both sections do not coincide.

Thank you for your careful reading. Indeed, the inverse and square of matrix $V$ was missing in the text. We have corrected that, and confirm that the quantile values used in the univariate and multivariate kernels coincide.

Section 5.1 ends with “We tried various values of Nmin in our experiments.” Could you add a conclusion sentence about the effect of the Nmin value?

We added a short comment at the end of the paragraph, and we discuss the effect of Nmin in the simulations sections.

Section 5.2: the NN acronym should be defined right after “nearest neighbors” is first used. I have to confess I first thought about “neural networks” when reading NN.

We apologize about the confusion, and have defined NN in the text now.

Numerical experiments: could the four tables be merged? The entries are always the same, it should be a question of adding new columns. Results would be more easily compared.

We have merged some of the tables, but have limited this to same example results as the html rendering of tables did not allow to clearly distinguish columns of different experiments. We hope this new version is a bit more readable.

Numerical experiments: “LDA axes”: write in plain words? LDA may have multiple meanings.

We have now defined the LDA axes as the linear discriminant analysis axes.

Figure 4 caption: “blue” is used twice while “green” is missing. Also: the “black dashed lines” are not much visible. Try in white instead?

With your suggestion of using the safe color-blind palette the black dashed lines are now visible. We have also modified the error in the figure caption.

“In this example again, bagging CARTs outperforms a classic random forest, which itself outperforms all local approaches.” It is not so clear-cut that “all local approaches” are outperformed.

You are right and we have mitigated this statement in the manuscript. However no local method reach the performance of bagging CARTs, and considering the computational cost of most of them, it is not clear they are worth using in practice.

Possible typos

Introduction: “To this effec” <- “To this effect”
Introduction: “we present/introduce”: choose only one verb?
It is often written “giving more weightS to”; I think the singular “weight” would be ok.
“Moreover, per tree a multidimensional kernel is used.” to be changed to “Moreover, a multidimensional kernel per tree is used.” or something like that.
Algorithm 1: “ends in the same leaf” <- “end in the same leaf”
After Algorithm 1: “The higher the $N_{min}$ ” <- “The higher $N_{min}$ ”
Section 6: Replace (Amaratunga, Cabrera, and Lee 2008) by Amaratunga, Cabrera, and Lee (2008) and (Maudes et al. 2012) by Maudes et al. (2012)
Breiman is the only author who’s first name (Leo) appears in the text, as well as in the list of References. First name to be removed?
Section 7: Add some space between citations “Robnik-Šikonja (2004);Tsymbal, Pechenizkiy, and Cunningham (2006)”
Section 8: “The classes have equal prior probabilities”: this may sound awkward to Bayesians… maybe just say “The classes have equal probabilities (or weights)”
Section 8: “The two first” <- “The first two” (other instances too)
“Note that during the preparation of the manuscript we detect […] R package ranger and have to redo…” <- “Note that during the preparation of the manuscript we detectED […] R package ranger and HAD to redo…”
“R package ranger”: use same font as with other packages?

Thank you for your careful reading of our manuscript. All the typos have been corrected.

Some random comments and thoughts about differences between pdf and html rendering

Both Algorithm 1 & 2 appear twice in the pdf rendering of the manuscript (but only once in the html version).
Some code snippets appear in the pdf but not in html.
Some words in the text have hyperlinks associated (which is a cool feature since it adds some possible further reading). While the hyperlinks are visible in html (they appear in blue font), this is not the case for the pdf version. Maybe just add something like “\hypersetup{colorlinks,citecolor=blue,linkcolor=red,urlcolor=blue}”?

Thank you for your comments, we hope we have addressed all these issues in the revised manuscript.

Second round

Thank you for your comprehensive responses to my comments and those of the other reviewer. I already had a positive view on the paper at the first round, and am now in favor of accepting it.