Contact: graham.yvette@gmail.com
The following is a description of how to carry out significance tests for an increase in Pearson correlation of one metric (or QE system) over that of a baseline metric (or baseline QE system) and is described further in the following papers:
Yvette Graham & Timothy Baldwin. "Testing for Significance of Increased Correlation
with Human Judgment", EMNLP 2014.
Yvette Graham, Timothy Baldwin, Nitika Mathur. "Accurate Evaluation of
Segment-level Machine Translation Metrics", NAACL 2015.
Yvette Graham. "Improving Evaluation of Machine Translation Quality Estimation", ACL
2015.
-
"R Statistical Software". To install R on the command line:
> sudo apt-get install r-base
-
R's "psych" package. To install R's "psych" package, open your R console, by typing:
> R
If your institution uses a proxy server, you need to tell R about it BEFORE (important) installing any package, here's what to do: Type the following commmand into R, remembering to provide your actual credentials and proxy server details:
> Sys.setenv(http_proxy="http://myusername:mypasssord@myproxyserver.com:8080/")
Install the "psych" package
> install.packages("psych")
You'll be given an option of a CRAN site, when you have one selected, you might need to answer "y" to some questions. When "psych" is finished installing, type the following to quit R:
> quit("no")
To run a one-sided application of Williams Test to test the significance of the increase in correlation between r12 and r13 below, with sample size n, where
-
r12: Pearson correlation between human scores and metric A,
-
r13: Pearson correlation between human scores and metric B,
-
r23: Pearson correlation between metric A and metric B.
R --no-save --args \<r12\> \<r13\> \<r23\> \<n\> < williams.R
This returns the p-value for the test that r12 is greater than r13.
For example, Williams test with the following:
R --no-save --args 0.65 0.55 0.3 50 < williams.R
Will produce the following output:
Williams Test for Increase in Correlation
r12 correlation( human, metric_a ) : 0.65
r13 correlation( human, metric_b ) : 0.55
r23 correlation( metric_a, metric_b) : 0.3
Sample size: 50
P-value: 0.209243482964009
Since p > 0.05, the increase in correlation is not significant.