PT65 Dataset
This repository contains lists with word-pairs used in the PROPOR 2014 paper entitled "Comparing Semantic Relatedness between Word Pairs in Portuguese Using Wikipedia". The repository contains three lists, named RG65 for the English version, created by Rubenstein and Goodenough [1]; JI65 for the French version, created by Joubarne and Inkpen [2]; and PT65 for the Portuguese version created by Granada et al [3].
In this dataset, all word pairs were manually judged by 50 undergraduate and graduate students who were asked to evaluate each pair according with their semantic relatedness. Evaluation scores range from 0 to 4 and the results were averaged over all 50 subjects. The average agreement among subjects using Pearson scores was r = .71 with a standard deviation σ = .13. Below we present all 65 pairs of words followed by their score.
word 1 | word 2 | score |
---|---|---|
cordão | sorriso | 0.26 |
galo | viagem | 0.14 |
almoço | barbante | 0.22 |
fruta | forno | 0.92 |
autógrafo | costa | 0.48 |
automóvel | bruxo | 0.28 |
monte | fogão | 0.26 |
risada | instrumento | 0.70 |
manicômio | fruta | 0.24 |
manicômio | monge | 0.64 |
cemitério | hospício | 1.14 |
cálice | mágico | 1.26 |
menino | galo | 0.92 |
almofada | bijuteria | 0.68 |
monge | escravo | 0.90 |
manicômio | cemitério | 1.16 |
litoral | floresta | 1.52 |
risada | rapaz | 1.06 |
costa | bosque | 1.14 |
monge | oráculo | 1.76 |
menino | sensato | 1.08 |
automóvel | almofada | 0.78 |
monte | costa | 1.22 |
rapaz | bruxo | 1.56 |
floresta | cemitério | 0.88 |
comida | galo | 1.38 |
cemitério | bosque | 1.00 |
costa | viagem | 1.60 |
pássaro | bosque | 1.98 |
litoral | colina | 1.44 |
forno | instrumento | 1.28 |
grua | galo | 0.04 |
colina | bosque | 1.64 |
carro | jornada | 1.56 |
cemitério | monte | 0.88 |
cálice | bijuteria | 0.66 |
mágico | oráculo | 2.08 |
grua | instrumento | 2.00 |
irmão | rapaz | 2.42 |
sensato | bruxo | 0.84 |
oráculo | sensato | 1.18 |
pássaro | grua | 0.24 |
pássaro | galo | 2.50 |
comida | fruta | 3.32 |
irmão | monge | 1.84 |
manicômio | hospício | 3.76 |
forno | fogão | 3.68 |
mágico | bruxo | 3.40 |
colina | monte | 3.56 |
cordão | barbante | 3.84 |
cálice | taça | 3.78 |
risada | sorriso | 3.64 |
servo | escravo | 3.58 |
jornada | viagem | 3.64 |
autógrafo | assinatura | 3.64 |
litoral | costa | 3.74 |
floresta | bosque | 3.76 |
instrumento | ferramenta | 3.64 |
galo | galo | 4.00 |
menino | rapaz | 3.58 |
almofada | travesseiro | 3.38 |
cemitério | cemitério | 4.00 |
automóvel | carro | 3.92 |
meio-dia | almoço | 3.22 |
jóia | bijuteria | 3.34 |
How to cite
When using the PT65 dataset in academic papers, please use this BibTeX entry:
@inproceedings{GranadaEtAl2014propor,
author = {Granada, Roger and Trojahn, Cassia and Vieira, Renata},
title = {Comparing Semantic Relatedness between Word Pairs in Portuguese Using Wikipedia},
booktitle = {International Conference on Computational Processing of the Portuguese Language},
series = {PROPOR 2014},
location = {S{\~a}o Carlos, Brazil},
pages = {170--175},
isbn = {978-3-319-09761-9},
doi = {10.1007/978-3-319-09761-9_17},
url = {https://link.springer.com/chapter/10.1007%2F978-3-319-09761-9_17},
month = {Oct},
year = {2014},
publisher = {Springer International Publishing}
}
Acknowledgment
This study was partially supported by the CAPES-COFECUB Cameleon project number 707/11.
References
[1] Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Communications of the ACM 8(10), pp. 627–633, 1965.
[2] Joubarne, C., Inkpen, D.: Comparison of Semantic Similarity for Different Languages Using the Google N-gram Corpus and Second-Order Co-occurrence Measures. In: Proceedings of the 24th Canadian Conference on Advances in Artificial Intelligence (Canadian AI'11), pp. 216-221, 2011.