Project: Bigram modeling of Welsh consonant mutation

R script and materials for modeling

This project was designed as a collaborated final project for a computational linguistics course at Northwestern University. The materials included were those I was responsible for as part of the collaboration. The modeling done by my collaborator is not included.

The goal of this project was to investigate the incidence of a phenomena called consonant mutation in Welsh. In this process, the initial consonants of words change (are mutated) depending on the characteristics of the word that precedes it. For example, the word 'caneuon' (meaning: songs) is pronounced as 'ganeuon' (c sound changes to g sound) when preceded by the word 'am' (meaning: about).

This project considered differences in the incidence of consonant mutation among the three classes of mutation in Welsh: soft, aspirate, and nasal (named according to the type of sound change they condition). Additionally, it considered whether the incidence of mutation differed across written and spoken corpora of Welsh. The current repository shows the modeling and results for the written corpus only (modeling of spoken corpus was done by collaborator). Finally, we considered whether the incidence of consonant mutation is sensitive to how likely any given word is to occur in the language (i.e., its frequency).

The results indicated that:

The incidence of consonant mutation differed across classes: soft > aspirate > nasal
The incidence of consonant mutation did not differ according to frequency of the mutated word in the language
Consonant mutation occurred a larger proportion of the time in written vs. spoken Welsh (results not shown)

Materials

This repository contains:

aspirate_environments.txt: a file containing the words that trigger aspirate mutation
CEG_modeling.py: Python script, which builds bigram models
CEG.txt: the target corpus of written Welsh
hiprobs.txt: a file containing the high frequency target words
lowprobs.txt: a file containing the low frequency target words
nasal_environments.txt: a file containing the words that trigger nasal mutation
proportionMutation_byClass.csv: output dataframe of incidence of mutation by class
proportionMutation_byFrequency.csv: output dataframe of incidence of mutation by target frequency
soft_environments.txt: a file containing the words that trigger soft mutation
welshConsonantMutation.pdf: project writeup, including theoretical motivation, methods, results, and interpretation
wordlist.txt: a file containing the target words selected for study

eng543/consonantMutation_bigramModeling

R script and materials for modeling

Materials