/tammi

Base code for the Tool for Automatic Measurement of Morphological Information (TAMMI)

Primary LanguageJupyter Notebook

TAMMI

Base code for the Tool for Automatic Measurement of Morphological Information (TAMMI). This does not include code for the graphic user interface.

The user-friendly version of the tool is available at linguisticanalysistools.org

TAMMI 2.0 was specifically designed to annotate and count morphological features in texts. In developing TAMMI 2.0, we provide automatic calculations for the MorphoLex dataframe provided by Sánchez-Gutiérrez et al. (2017). We also included an automatic calculation of morphological complexity index (MCI) based on inflections as detailed by Brezina and Pallotti (2019). In addition, we calculated an MCI for derivational morphemes and developed new morphological complexity indices based on morphological variety and type-token ratios for both inflectional and derivational morphemes. Lastly, we calculate a number of basic morpheme counts. The indices reported by TAMMI 2.0 are discussed below.

Basic morpheme counts. TAMMI 2.0 includes basic morpheme counts for the number of tokens with inflections and derivational morphemes. The inflections are counted using spaCy (Honnibal & Montani, 2017) by assessing the differences between each token and its lemma. TAMMI also computes the number of words that has prefixes and affixes as well as the number of compound words. In addition, TAMMI 2.0 calculates the total number of prefixes, roots, suffixes, and affixes per text as well as a combination of the number of roots and affixes and a combination of the number of roots, affixes, and inflections. TAMMI 2.0 also computes normed indices by taking the count for each variable and dividing it by 1) the number of content words (i.e., verbs, nouns, adjectives, adverbs) in the text, and by 2) the number of content words with the relevant morpheme. For example, when calculating indices for the number of prefixes, TAMMI 2.0 will count the number of prefixes in a text and provide two normed scores. The first count will be the number of prefixes in the text divided by the number of content words in the text. The second count will be the number of prefixes divided by the number of words with prefixes. It is expected that indices normed by morpheme type may not perform well on texts with simple morpheme use because each word may only contain a single morpheme type (i.e., a text may contain only a single prefix in all words that contain a prefix, giving a normed score by prefix of 1). Thus, users should only use the prefix normed counts when examining longer texts that are representative of more advanced language use.

Morphological variety. The inflection morphological variety feature in TAMMI 2.0 is based on a within-subset variety score in which content words from each text are broken into windows of 10 words (plus a window of 1-to-9 for any remaining content words at the end of the text). Inflectional morpheme types (e.g., -s and -ed) for each content word in the window, and null tokens for words without inflections in the window, are counted for each 10-word window and then divided by the total number of windows. A similar approach is used to assess derivational morpheme variety. However, since a content word could have multiple derivational morphemes, the windows of 10 words and/or null counts could have multiple derivational morphemes per word. Thus, a window of ten derivational morphemes and/or null counts may reflect 10 content words or fewer. Morphological complexity. TAMMI 2.0 calculates an index for inflectional morphemes based on the MCI reported in Brezina and Pallotti (2019) by using the morphological variety counts above. For inflections, the within-subset variety score is added to the between-subset diversity score (i.e., the average number of unique morphemes when comparing subsets; for example, I loved him and she loves him each have one unique morpheme, -ed and -s). This score is then divided by the number of subsets minus 1. The same approach is followed to produce an MCI for the derivational morphemes, which was not reported by Brezina and Pallotti (2019).

Morpheme type-token counts. TAMMI 2.0 includes indices of type-token ratios for both inflectional and derivational morphemes. For inflectional morphemes, we use the number of unique inflectional morphemes by 10 content word window divided by the length of the window (knowing that the last window may be less than 10 words) and average the score across the text. For derivational morphemes, we calculate a similar metric, but we use a 10-morpheme window because some content words have more than one derivational morpheme.

MorphoLex Variables. TAMMI 2.0 depends on MorphoLex to calculate variables related to frequency/length, family size counts and frequency, and hapax counts. TAMMI 2.0 matches tokens reported in spaCy to the MorphoLex dictionary. Like basic counts, TAMMI 2.0 computes mean scores for MorphoLex variables within a text by taking the count for each variable and dividing it by the number of content words (i.e., verbs, nouns, adjectives, adverbs) in the text to provide a normed score. TAMMI 2.0 also produces MorphoLex indices normalized by words with the specific morpheme of interest. The MorphoLex variables calculated are discussed below.

Morpheme frequency/length counts. For roots, prefixes, suffixes, and all affixes, TAMMI 2.0 extracts frequency counts for morphemes from MorphoLex. The frequency count comes from the HAL counts found in the ELP (Balota et al., 2007). TAMMI 2.0 computes a raw frequency count and a logged frequency count. TAMMI 2.0 also calculates the average length of the roots, prefixes, suffixes, and all affixes.

Morpheme family size counts. For roots, prefixes, suffixes, and all affixes, TAMMI 2.0 derives family size counts from MorphoLex. Family size for morphemes are calculated by counting the number of word types to which a morpheme can attach itself. As an example, in the pool of words attendance, pleasance, pleasure, appearance, the suffix -ance has a family size of 3, but the root pleas has a family size of 2 (example taken from Sánchez-Gutiérrez et al., 2017). For roots, family size is calculated by the number of words a root can produce (e.g., the count for the number of words that have theo as a root).

Morpheme family size frequency. TAMMI 2.0 also reports the percentage of other words in the family that are more frequent (PFMF) from MorphoLex. This feature counts the percentage of morphemes per word that are more frequent by dividing the number of more frequent words in a family by the total number of family members. For instance, word, wordiness, and wordlessly all have the same root (i.e., word), but the word word is the most frequent type in the family (PFMF = 0%) whereas wordlessly has 10 terms that are more frequent (PFMF = 45%) and wordiness has 15 types that are more frequent (PFMF = 70%; example taken from Sánchez-Gutiérrez et al., 2017). Thus, a lower value indicates a word that is more frequent in the family, and higher PFMF values can contribute to greater morphological complexity.

Hapax counts. Hapaxes are defined in MorphoLex as words that only appear once in a corpus. Affixes that attach to a greater number of hapaxes are more productive and can be used to create new words. TAMMI 2.0 derives two types of hapax counts from MorphoLex: the number of prefixes/suffixes/affixes that are attached to hapaxes and the number of hapaxes that include prefixes/suffixes/affixes.