/Dutch-tagger

Simple perceptron tagger trained using the NLTK on the NLCOW14 corpus.

Primary LanguagePython

Dutch tagger

Don't use this tagger for actual research or production! Use SpaCy instead (faster, more reliable). I'm leaving this up only as educational material.

This repository contains a trained part-of-speech tagger for Dutch, as well as the code used to train it. (The file cowparser.py comes from this repository.) Don't use the tagger in a production environment, unless you train it yourself using some other data. This code just shows you how the NLTK tagger works. I recommend Treetagger, Frog, or SpaCy.

Requirements:

  • NLTK version 3.1
  • Python 3

Key facts:

  • The tagger was trained on the NLCOW14 corpus (which in turn was tagged using TreeTagger).
  • The accuracy is about 97% on held-out data from the same corpus.
  • The small model is trained on 2 million tokens, while the larger model is trained on 10 million tokens.
  • The accuracy of the larger model is slightly better than the smaller model, but the larger model is over three times as large.

How to use the tagger.

First run bash create_models.sh. This will create the models for you. Then use the following code.

from nltk.tag.perceptron import PerceptronTagger

# This may take a few minutes. (But once loaded, the tagger is really fast!)
tagger = PerceptronTagger(load=False)
tagger.load('model.perc.dutch_tagger_small.pickle')

# Tag a sentence.
tagger.tag('Alle vogels zijn nesten begonnen , behalve ik en jij .'.split())

Result:

[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]

If the text is not tokenized yet, you can use the built-in tokenizer from the NLTK (be sure to download the NLTK data):

import nltk.data
from nltk.tokenize import word_tokenize

sent_tokenizer = nltk.data.load('tokenizers/punkt/dutch.pickle')
    
def tokenize(text):
    for sentence in sent_tokenizer.tokenize(text):
        yield word_tokenize(sentence)

sentences = tokenize('Alle vogels zijn nesten begonnen, behalve ik en jij. Waar wachten wij nu op?')

for sentence in sentences:
    print(tagger.tag(sentence))

Result:

[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]
[('Waar', 'pronadv'), ('wachten', 'verbprespl'), ('wij', 'pronpers'), ('nu', 'adv'), ('op', 'adv'), ('?', '$.')]