/naive_dt_fix_py

Fix wrong participle endings in Dutch automatically (Python)

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

naive_dt_fix_py

Fix wrong participle endings in Dutch automatically (Python). This is a port of naive-dt-fix.

Background on 'dt' errors in Dutch

Language users make mistakes. In Dutch, one of the most common mistakes is the spelling of the verbal past participle. According to the normative literature, the last letter of this participle is dependent on the last sound of the stem of a verb.

  • If the last sound of the stem is voiced, the past participle receives the d ending. For example, in gebeuren, the last sound [r] in stem gebeur is voiced. The participle is thus gebeurd.
  • If the last sound of the stem is unvoiced, the past participle receives the t ending. For example, in passen, the last sound [s] in stem pas is unvoiced. The past participle is thus gepast.

There are some cases in which it is confusing for language users what the last sound of the stem actually is. For example, in razen, the last sound [z] in stem raaz is voiced. However, because in the first person singular raas, the ending is devoiced, language users come to think that raas is actually the stem, with s as its devoiced ending. As a result, they will write geraast as the past participle (which is wrong).

Dutch written documents are fraught with these kinds of mistakes. This can make it difficult to conduct linguistic research, since some verbs suddenly appear multiple times in different forms. This small library aims to rectify these mistakes in order to make the aggregation of forms easier.

How the library works

The proposed solution is inherently flawed in that it relies on the relative frequencies of the participles in dataset you supply. It can be powerful, but it can also make mistakes (hence why the library is called naive-dt-fix). So, how does it work?

  1. The correction function first creates a list of all unique participles in your dataset.
  2. Then, it goes over each unique participle individually.
    • If the participle ends in d, it constructs that same participle, but ending in t. For example, if it comes across gedansd, it creates a form gedanst in memory.
    • If the participle ends in t, it constructs that same participle, but ending in d. For example, if it comes across ontwikkelt, it creates a form ontwikkeld in memory.
    • If the alternative form in memory is more frequent in your dataset than the original participle, the original participle is replaced with the more frequent form. For example, it is likely that most language users will spell gedanst correctly, so the incorrect form gedansd will be replaced.

There are a few problems with this approach. I tried to rectify these as well as I could given the scope of the project:

  • Some forms are ambiguous and can appear both with d and with t. For example, the past participle of planten is geplant, the participle of plannen is gepland. It depends on the context of the sentence whether either of these forms is actually spelt correctly. Since this is a naive approach, we do not actually look at the context. Still, we do not want geplant to be replaced by gepland when the latter form is by chance more frequent (or vice versa). This is why the library ships with a list of ambiguous cases. These cases are ignored. Should your dataset contain an ambiguous pair that is not included automatically in the library, you can overwrite the list yourself.
  • Some forms most language users simply cannot spell. This means that the blatantly wrong form is the most frequent, which causes the replacement function to replace all correct forms with wrong ones. This is especially frequent for verbs derived from English. To combat this issue, the library ships with a list of absolutely correct participles. The replacement function will always give precedence to forms on this list. Should your dataset contain a very frequently misspelt form that is not included automatically in the library, you can overwrite the list yourself.
  • If a correct form does not appear at all, incorrect spellings will never be fixed. I built in a makeshift solution for this issue by always giving precedence to forms in the list of absolutely correct participles, but of course I cannot include every single correct participle manually. The whole purpose of this library is to sidestep these kinds of lists by relying on frequency information. It remains important to always check your output for mistakes.

How to use the library

Cone this repository into the directory of your Python project. Then, in your script, import the fix_participle_dt function:

from naive_dt_fix_py.naive_dt_fix import fix_participle_dt

You can now use the fix_participle_dt function. It has four arguments, two of which are optional:

parameter type description example
df pd.DataFrame the dataframe which contains the column you want to correct /
column str the name of the column you want to correct "participle"
ignore_list (optional) list[str] a list containing ambiguous forms which should be ignored /
correct_list (optional) list[str] a list containing correct forms which should always take precedence over alternatives /

If you do not specify ignore_list or correct_list, the built-in defaults will be used.

The function returns the input data frame with the corrections applied.

df = fix_participle_dt(df, "deelwoord")

Each correction will be announced as console output:

[naive-dt-fix] Will replace 'gekuisd' with 'gekuist'
[naive-dt-fix] Will replace 'geblokkeert' with 'geblokkeerd'
[naive-dt-fix] Will replace 'verwarmt' with 'verwarmd'

Check this output for mistakes! The library is by NO means perfect!

Future work

  • extend the built-in lists by collecting all verbal past participles in SoNaR and checking for mistakes.