CrowdWS-102 is a crowdsourced collection of the similarity and relatedness scores of word pairs. Word similarity datasets, such as WordSimilarity-353 (WS-353), are commonly used for evaluating NLP techniques, such as word-embedding methods. Existing word similarity datasets have been created based on averaging the scores reported by several annotators to create a gold standard.
The aim of this dataset is to collect word similarity scores from a large number of annotators to understand the following three hypotheses.
- (H1) The framing effect influences similarity ratings by human assessors.
- (H2) The distribution of similarity rating does not follow the Gaussian distribution.
- (H3) Semantic relatedness is not symmetric. The relatedness between words (e.g., tiger and cat) yields different similarity ratings in the opposite word order.
The 102 word pairs in the dataset were randomly chosen from WS-353, a well-known word similarity dataset. Then, we published crowdsourcing tasks at Amazon Mechanical Turk to collect 0-10 scale word similarity annotations from 50 distinct annotators for each word pair. We also used the same instructions as WordSim353.
Here is a screenshot of an example task.
To evaluate the symmetricity of the word similarity of a word pair, we tested the four conditions (sim, inverted-sim, dissim, inverted-dissim) for the 102 selected pairs. For instance, the questions used to determine the similarity score between word X and Y were as below:
- Similarity (sim): "How is X similar to Y?"
- Inverted similarity (inv-sim): "How is Y similar to X?"
- Dissimilarity (dis): "How is X dissimilar to Y?"
- Inverted dissimilarity (inv-dis): "How is Y dissimilar to X?"
crowdws-102.csv
contains all of the dataset’s information. A schema description of the file can be found below.
- pair (str): Word pair
- word_x (str): Word X
- word_y (str): Word Y
- scores (list): List of scores (0-10).
10
(0
) means closely related, and0
(10
) means very unrelated forqtype="sim" ("dis")
orqtype="inv-sim" ("inv-dis")
. Note that the scores have opposite meanings for "similarity annotation" and "dissimilarity annotation" - qtype (str): Question type {"sim", "dis", "inv-sim", "inv-dis"}
Here is example Python code that shows the first five lines of the dataset.
>>> import pandas as pd
>>> df = pd.read_csv("")
pair word_x word_y \
0 word-similarity word similarity
1 tiger-jaguar tiger jaguar
2 territory-kilometer territory kilometer
3 stock-phone stock phone
4 start-year start year
scores qtype
0 [1.0, 1.0, 5.0, 5.0, 6.8, 0.0, 2.0, 4.0, 4.0, ... sim
1 [9.0, 10.0, 1.0, 7.0, 7.5, 8.0, 9.0, 7.0, 3.0,... sim
2 [3.0, 6.0, 4.0, 5.0, 5.9, 5.0, 5.0, 0.0, 6.0, ... sim
3 [0.0, 0.0, 3.0, 4.0, 7.9, 0.0, 3.0, 0.0, 4.0, ... sim
4 [2.0, 3.0, 8.0, 4.0, 4.8, 2.0, 2.0, 0.0, 4.0, ... sim
Please cite the following publication if you use the dataset in your work.
Malay Bhattacharyya, Yoshihiko Suhara, Md Mustafizur Rahman, Markus Krause,
``Possible Confounds in Word-based Semantic Similarity Test Data'',
In Proc. ACM Conference on Computer Supported Cooperative Work and
Social Computing (CSCW '17), pp. 147-150, 2017.
Available: http://dx.doi.org/10.1145/3022198.3026357
This is a blog article that summarizes our project at CrowdCamp 2016.
CrowdWS-102 is made available under the Open Data Commons Attribution License: http://opendatacommons.org/licenses/by/1.0/.
This work was sponsored by the CrowdCamp workshop held at AAAI HCOMP 2016.