Is the current applicable condition of t-test correct?

The Metric.calc_confidence_interval method considers t-test applicable only when a given evaluation metric score is a simple average of statistics. I'm just wondering whether another condition needs to be checked or not. I would appreciate if there is a reference on the applicable condition of t-test.

@neubig @odashi I haven't found reliable resources in statistics about the condition. It would be great to point out such resources if any.

@neubig @tetsuok

I think we need to breakdown the condition of when it is applicable.

First, t-test can be applicable iff the target variables are sampled from a normal distribution. I think non-aggregated statistics satisfies this condition very rarely.
According to the central limit theorem, the sum (mean) of something tends toward a normal distribution regardless of the original distribution if the sample size is large enough. So if the targeted variables are calculated by simple summation or mean, t-test may be applicable.

I believe this is correct. See 3.2.1 here: https://aclanthology.org/P18-1128.pdf

@neubig It looks the paper says the same thing with my comment.

According to the current implementation, the sufficient statistics of the t distribution is obtained from aggregating original values of the given sample.

ExplainaBoard/explainaboard/metrics/metric.py

Lines 429 to 443 in cc2d315

    
           if num_stats != 1: 
        
               raise ValueError( 
        
                   "t-test can be applied for only 1 stat, " 
        
                   f"but the MetricStats has {num_stats} stats." 
        
               ) 
        
           my_mean = np.mean(stats_data) 
        
           my_std = np.std(stats_data) 
        
           if my_std == 0.0: 
        
               return (float(my_mean), float(my_mean)) 
        
           return stats_t.interval( 
        
               alpha=confidence_alpha, 
        
               df=stats_data.shape[-2] - 1, 
        
               loc=my_mean, 
        
               scale=my_std, 
        
           )

This means that the code does not satisfy the condition 2 of my comment, and this code looks available only when the original values are normally distributed.

Additional notes:

We may be able to assume that the variable my_mean is sampled from a normal distribution $\mathcal{N}_m$ according to the is_simple_average condition. However, we don't know the mean and the variance of $\mathcal{N}_m$ , which are different of ones calculated from the data.

Good question. Here is a reference that I think answers the question for large n: https://stats.stackexchange.com/a/44280

For small n (e.g. less than 30), the data being non-Gaussian could be an issue. Here is a nice simulation that examines how inaccurate the confidence intervals are for variously distributed data: https://stats.stackexchange.com/questions/242037/robustness-of-the-student-t-test-to-non-gaussian-data

To give some background: we used to calculate confidence intervals only using the bootstrap, but the bootstrap is slow and actually was actually the major speed bottleneck in ExplainaBoard analyses. So this is a faster (but possibly less precise) solution.

One option would be to use the t-test for larger values of n, and bootstrap for smaller values of n. But then how to choose this threshold would be another tricky question (also dependent on the underlying data distribution).

The CLT can not be applied to the original distribution.
The CLT can be applied to distribution of sums/means (with enough sample size) generated from the original data.

The problem here may be that the current code actually calculates the former distribution by implicitly assuming that the sample is normally distributed. If we want to apply the CLT assumption, we need to prepare datasets of sums somehow, not the original data.

btw, I drew some examples of the CLT in shorter code:

(independent from the figure above) In general, I think 30 would be a fair threshold to switch between parametric test and bootstrapping.

Hi!

we need to prepare datasets of sums somehow, not the original data.

That's basically what the bootstrap test is doing right?

AFAIK, student's t-test is avoiding the necessity to do explicit bootstrapping by assuming that the sample mean follows Student's t-distribution (which converges to the normal at large n). This is true if the data is normally distributed, not true but close enough for non-normal data with large n (and finite variance), but not true and potentially very inaccurate for non-normal data with small n. Here is another reference that explains in a bit more detail: https://stats.stackexchange.com/questions/9573/t-test-for-non-normal-when-n50

Hi all, I learnt again about the theory around this.

That's basically what the bootstrap test is doing right?

Yes

t-test is avoiding the necessity to do explicit bootstrapping by assuming that the sample mean follows Student's t-distribution

Yes. and the topic we need to discuss is whether we are calculating the correct statistics of the samples.

First, I noticed that I have a misunderstanding around the following statistics:

ExplainaBoard/explainaboard/metrics/metric.py

Lines 434 to 435 in cc2d315

    
           my_mean = np.mean(stats_data) 
        
           my_std = np.std(stats_data)

that calculate the correct variables applicable for one-sample t-tests. I originally thought that means/stds should be calculated from the statistics of means, but according to a text, it is sufficient to be calculated from the sample. Sorry for confusion!

As we discussed above, t-test is only applicable when we can assume that the sample mean is sampled from the normal distribution (not t distribution). The condition requires that the sample size is large enough according to the CLT. Therefore, the condition of the test in our code should be updated to at least:

is_simple_average and sample_size > threshold

I also noticed that there are other issues around the current code. It calculates a confidence interval of one sample, meaning that it is trying one-sample test and the test requires the equality of variations between two samples. I think this is not easily assumed in most cases, and we need to revise the underlying algorithm.

One possible way is applying Welch's test, which is enough robust against inequality of variances.

Another possibility is to apply the paird t-test as discussed in the paper Graham mentioned above.

By either Welch's test or paird test, we need two samples at the same time, and one-sample confidence interval is not applicable.

Thanks!

I agree with this:

is_simple_average and sample_size > threshold

I'm not sure I totally understood this though:

It calculates a confidence interval of one sample, meaning that it is trying one-sample test and the test requires the equality of variations between two samples.

AFAIK, the t-test can also calculate the confidence interval of the mean of a single sample, in addition to calculating the difference between means of two samples. For instance, see the first two sections of this article: https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/7-t-tests

@neubig one-sample t-test is applicable only when we can assume that two samples have the same variance (because it is sensitive against only the variance of one sample). It is in general not guaranteed in our pipeline.

Sorry I included something wrong: I was talking about the condition of two-sample t-tests. and there are other conditions for one-sample testing.

@neubig Hi, I think my previous comment was almost the point that we need to make some consensus.

One-sample t-test is performed under an assumption that the population mean of the null hypothesis $\mu_0$ is strictly a constant. If the hypothesis follows this condition, e.g., "the accuracy of the proposed system is distributed around 70%", one-sample t-test is applicable I think.

If we need to compare two systems, we basically need to assume that these systems (baseline and proposed) are symmetry (both are distributed, and have the same set of properties of the distribution). In this case, confidence intervals calculated from one sample is not suitable to judge the hypothesis, and we need other tests that compares two systems explicitly: pairwise testing (including difference testing) or two-samples testing. In difference testing, we can also calculate the confidence interval against the distribution of differences, which is not the same as one-sample confidence interval.

OK, great, so I think we agree then? Just rephrasing to make sure that we're on the same page:

For calculating confidence intervals of a single sample, what we're doing now is probably OK, but we should add a check that the sample size is larger than a threshold (e.g. 30).

For comparing two systems, we need to implement something else (issue #325). Note that pairwise bootstrap tests are actually already implemented in explainaboard_web, so the code could probably be moved into ExplainaBoard: neulab/explainaboard_web#259

@neubig We finally need to breakdown this issue to actual tasks I guess.
@tetsuok Could you check the whole thread?

@odashi sure.

@odashi I think required code change and the direction look good to me.

@tetsuok I think the task necessary to be applied is only modifying the condition:

is_simple_average and sample_size > threshold

If you have time, could you take some time to make a change?

@odashi sure.

Closing as the required change was merged as c7a9fad.

	if num_stats != 1:
	raise ValueError(
	"t-test can be applied for only 1 stat, "
	f"but the MetricStats has {num_stats} stats."
	)
	my_mean = np.mean(stats_data)
	my_std = np.std(stats_data)
	if my_std == 0.0:
	return (float(my_mean), float(my_mean))
	return stats_t.interval(
	alpha=confidence_alpha,
	df=stats_data.shape[-2] - 1,
	loc=my_mean,
	scale=my_std,
	)