/cog-sge-semantics

Primary LanguageJupyter Notebook

Semantic similarity in Singapore English

This repository contains scripts and datasets used in the study with the name above.

  • data/: contains the SimLex-999 dataset, and cleaned data collected from human participants.
  • embeddings/: contains word embeddings trained on a subset of the Corpus of Contemporary American English (COCA) and an SgE corpus collated by Lin et al. (2022).
  • scripts/: contains Python scripts used in data cleaning and analysis.
    • clean/: scripts used in data cleaning.
    • compare/: scripts used in data analysis.
    • train/: scripts used in the training of word embeddings.