/phoneme-frequencies

Primary LanguageMakefileMIT LicenseMIT

Quick links

Summary

An estimate of the relative frequencies of English phonemes. Also, an estimate of the relative frequencies of English phonemes that follow /w/.

Methodology

Reproducing the work of Doug Blumeyer, I correlated the CMU Pronouncing Dictionary ("CMUdict") and Adam Kilgarriff's unlemmatized frequency list for the British National Corpus to find phoneme frequencies generally. I extended this technique to estimate post-/w/ phoneme frequencies as well.

Limitations

As Blumeyer notes, the source datasets have some limitations. CMUdict conflates "schwa with the near-open central vowel" and has "several noticeable errors." Kilgarriff's frequency list has some formatting issues that make it hard to work with words with accents and apostrophes, (at this time, I've completely ignored this issue) including common contractions.

Blumeyer did manual error checking on several hundred of the most common words. I have not done this.

The CMUdict has multiple pronunciations for some words. For these words, I used only the first pronunciation given. It's not clear to me if in these cases the multiple pronunciations are ordered in some way or just ordered arbitrarily.

Other notes

While the Kilgarriff list is for the British National Corpus, a quick inspection suggests that it uses American pronunciations over British ones.

References