
PT65 Dataset: A word-pair dataset for semantic relatedness in Portuguese

PT65 Dataset

This repository contains lists with word-pairs used in the PROPOR 2014 paper entitled "Comparing Semantic Relatedness between Word Pairs in Portuguese Using Wikipedia". The repository contains three lists, named RG65 for the English version, created by Rubenstein and Goodenough [1]; JI65 for the French version, created by Joubarne and Inkpen [2]; and PT65 for the Portuguese version created by Granada et al [3].

In this dataset, all word pairs were manually judged by 50 undergraduate and graduate students who were asked to evaluate each pair according with their semantic relatedness. Evaluation scores range from 0 to 4 and the results were averaged over all 50 subjects. The average agreement among subjects using Pearson scores was r = .71 with a standard deviation σ = .13. Below we present all 65 pairs of words followed by their score.

word 1 word 2 score
cordão sorriso 0.26
galo viagem 0.14
almoço barbante 0.22
fruta forno 0.92
autógrafo costa 0.48
automóvel bruxo 0.28
monte fogão 0.26
risada instrumento 0.70
manicômio fruta 0.24
manicômio monge 0.64
cemitério hospício 1.14
cálice mágico 1.26
menino galo 0.92
almofada bijuteria 0.68
monge escravo 0.90
manicômio cemitério 1.16
litoral floresta 1.52
risada rapaz 1.06
costa bosque 1.14
monge oráculo 1.76
menino sensato 1.08
automóvel almofada 0.78
monte costa 1.22
rapaz bruxo 1.56
floresta cemitério 0.88
comida galo 1.38
cemitério bosque 1.00
costa viagem 1.60
pássaro bosque 1.98
litoral colina 1.44
forno instrumento 1.28
grua galo 0.04
colina bosque 1.64
carro jornada 1.56
cemitério monte 0.88
cálice bijuteria 0.66
mágico oráculo 2.08
grua instrumento 2.00
irmão rapaz 2.42
sensato bruxo 0.84
oráculo sensato 1.18
pássaro grua 0.24
pássaro galo 2.50
comida fruta 3.32
irmão monge 1.84
manicômio hospício 3.76
forno fogão 3.68
mágico bruxo 3.40
colina monte 3.56
cordão barbante 3.84
cálice taça 3.78
risada sorriso 3.64
servo escravo 3.58
jornada viagem 3.64
autógrafo assinatura 3.64
litoral costa 3.74
floresta bosque 3.76
instrumento ferramenta 3.64
galo galo 4.00
menino rapaz 3.58
almofada travesseiro 3.38
cemitério cemitério 4.00
automóvel carro 3.92
meio-dia almoço 3.22
jóia bijuteria 3.34

How to cite

When using the PT65 dataset in academic papers, please use this BibTeX entry:

  author    = {Granada, Roger and Trojahn, Cassia and Vieira, Renata},
  title     = {Comparing Semantic Relatedness between Word Pairs in Portuguese Using Wikipedia},
  booktitle = {International Conference on Computational Processing of the Portuguese Language},
  series    = {PROPOR 2014},
  location  = {S{\~a}o Carlos, Brazil},
  pages     = {170--175},
  isbn      = {978-3-319-09761-9},
  doi       = {10.1007/978-3-319-09761-9_17},
  url       = {https://link.springer.com/chapter/10.1007%2F978-3-319-09761-9_17},
  month     = {Oct},
  year      = {2014},
  publisher = {Springer International Publishing}


This study was partially supported by the CAPES-COFECUB Cameleon project number 707/11.


[1] Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Communications of the ACM 8(10), pp. 627–633, 1965.
[2] Joubarne, C., Inkpen, D.: Comparison of Semantic Similarity for Different Languages Using the Google N-gram Corpus and Second-Order Co-occurrence Measures. In: Proceedings of the 24th Canadian Conference on Advances in Artificial Intelligence (Canadian AI'11), pp. 216-221, 2011.