Semantic similarity in Singapore English

This repository contains scripts and datasets used in the study with the name above.

data/: contains the SimLex-999 dataset, and cleaned data collected from human participants.
embeddings/: contains word embeddings trained on a subset of the Corpus of Contemporary American English (COCA) and an SgE corpus collated by Lin et al. (2022).
scripts/: contains Python scripts used in data cleaning and analysis.
- clean/: scripts used in data cleaning.
- compare/: scripts used in data analysis.
- train/: scripts used in the training of word embeddings.

arsatis/cog-sge-semantics