/deep-german

Classification of German nouns by genders using different RNN, CNN, and MLP architectures.

Primary LanguagePythonApache License 2.0Apache-2.0

DeepGerman project is an attempt to infer the gender (masculine, feminine, or neutral) of a German noun from its raw character-level representation. Several RNN, CNN, and MLP models of different architecture were trained and evaluated manually (by entering "words") and automatically (by randomly generated "words"). RNN input is one-hot representation of subsequent word characters at each time step. CNN input is stacked one-hot vectors of word characters, on which 1D-convolution is done (along characters dimension). MLP input is concatenated one-hot vectors of all characters. The output of both model types is three-class softmax for three genders. The project is implemented using TensorFlow. Done in a team with Rahul Bohare, Mesut Kuscu, and Aysim Toker.

Runnable scripts:

  • rnn_deep_german.py - configurable RNN trainer (CL arguments specified in the script).
  • cnn_deep_german.py - configurable CNN trainer (CL arguments specified in the script).
  • mlp_deep_german.py - configurable MLP trainer (CL arguments specified in the script).
  • evaluate_manual.py - manual model evaluation by entering a word and observing the inferred gender probabilities.
  • evaluate_auto.py - automatic model evaluation by generating multiple random words of variable length with fixed endings corresponding to a given gender (e.g. "-ung" for feminine) and observing the statistics of inferred gender classes for each ending.

The model achieving 96.24% classification accuracy on the test set - 2-layer LSTM with 128 hidden units in each layer, trained with dropout and batch size of 128 - is available in "results" folder (training log in "/results/logs", TF checkpoint in "/results/models"). The results of automatic evaluation of this model (obtained by running evaluate_auto.py) are shown below. 10,000 random words of varying length have been generated per gender ending. The numbers in three columns of the table show the percentage of the words classified as masculine, feminine, and neutral for each of the endings:

masculine endings
-------------------------------------
-ant       96.36     1.96      1.68      
-ast       93.41     3.84      2.75      
-er        89.72     2.35      7.93      
-ich       75.00     0.28      24.72     
-eich      89.94     0.15      9.91      
-ig        75.64     0.10      24.26     
-eig       75.27     0.25      24.48     
-or        83.17     2.13      14.70     
-us        91.32     2.92      5.76      
-ismus     100.00    0.00      0.00      

feminine endings
-------------------------------------
-anz       22.27     76.52     1.21      
-e         18.22     75.42     6.36      
-enz       4.03      93.92     2.05      
-heit      10.33     89.29     0.38      
-ie        6.68      90.36     2.96      
-schaft    2.46      97.22     0.32      
-sion      0.08      98.62     1.30      
-tät       6.87      85.96     7.17      
-ung       7.38      91.13     1.49      
-ur        35.58     59.64     4.78      

neutral endings
-------------------------------------
-chen      3.21      0.00      96.79     
-lein      2.53      0.35      97.12     
-en        18.67     0.53      80.80     
-il        13.78     0.16      86.06     
-ing       10.73     2.13      87.14     
-ma        1.56      23.44     75.00     
-ment      7.92      0.11      91.97     
-nis       12.43     21.68     65.89     
-tum       2.80      0.07      97.13     
-um        17.71     0.24      82.05