Introduction

The goal of the project is to implement part-of-speech(pos) tagger using structured learning.

This project uses Conditional Random Fields modeling method and sklearn-crfsuite implementation of this method.

Source code of the project can be found on github.

Baseline

Basic implementation of the tagger is similar to sklearn-crfsuite tutorial, but uses different data sets, feature sets and labels.

Datasets

crf-pos-tagger uses pos_train.conll dataset for training model and pos_test.conll for evaluating results.

Datasets contains twitter messages, which contains pairs (pos tag, token) and separated by empty line. Token is a word, mention, url, hashtag, number, punctuation mark, special symbol and so on. Twits are not very good in terms of right spelling(ppl, u, ill, etc) and absolutely afwul in terms of cases(i LOVE U sO MuCH). Also, it is pretty hard to separate sentences, it may finishes with period of may not. That is why crf-pos-tagger uses twits instead of sentences.

Parser

For such dataset and with assumption about twit ~== sentence it is pretty easy to write parser, which generates list of lists of pairs.

def parse_file(filename):
    f = open(filename, 'r')
    raw = f.readlines()
    sentences = []
    s = []
    for line in raw:
        if line.strip():
            tag, token = line.strip().split('\t')
            s.append((token, tag))
        else:
            sentences.append(s)
            s = []
    return sentences

Algorithms

sklearn-crfsuite provides five algorithms: – ‘lbfgs’ - Gradient descent using the L-BFGS method – ‘l2sgd’ - Stochastic Gradient Descent with L2 regularization term – ‘ap’ - Averaged Perceptron – ‘pa’ - Passive Aggressive (PA) – ‘arow’ - Adaptive Regularization Of Weight Vector (AROW)

Choice of algorithm wasn’t based on implementation details. Playing with different values of different parameters and different feature sets showed that following configuration one of the best results for project needs:

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
)

Evaluation

Evaluation is straightforward: crf-pos-tagger compares tags generated by model and tags written in pos_test.conll for each token and calculates number of matched divided by number of all tokens.

Default implementation without features shows the percentage of matches equal to 0.1053.

Improvements

After implementation of basics it is necessary to add some features to our model to improve.

feature set	result
without features	0.1053
+word suffix	0.7636
+mention	0.7756
+hashtag	0.7859
+lowercased suffix	0.8042
+urls	0.8082
+word itself	0.8203
+number	0.8211
+more lowercased suffixes	0.8412
+is title	0.8519
+is upper	0.8537

Word suffix

First most obvious and intuitive idea is using suffix of the word to determine part of the speech. Trying different length of the suffix shows that most score reached at last for characters.

'word[-4:]': word[-4:],

Mention and hashtag

Obvious features for dataset based on twits, which helps model to find twitter specific parts of speech. 66 and 26 lines out of 2361 contains @ and #.

'mention': word.startswith('@') and len(word) > 1,
'hashtag': word.startswith('#') and len(word) > 1,

Lowercased suffix

This improvement also uses knowledge of dataset nature. Twitter users don’t care about chAracTeRs case. That is why lowercased suffix can increase accuracy.

Urls

Another twitter specific part of speech is an url.

def is_url(s):
    # https://gist.github.com/gruber/249502#gistcomment-6465
    if re.match(r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', s):
        return True
    else:
        return False

'url': is_url(word),

Word itself

Why not? Seems it may help in some cases and it is actually helps a lot. Of course we use lowercased version of the word.

'word.lower()': word.lower(),

Number

Another binary feature, which just emits True if token is a number.

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

'number': is_number(word),

More lowercased suffixes

Adding more prefixes with different length significantly improves model’s score.

'word[-3:]': word.lower()[-3:],
'word[-2:]': word.lower()[-2:],
'word[-1:]': word.lower()[-1:],

Is title

Many users don’t care about case of there letters, but some people do. Probably the case of the character correlates with part of speech in some cases.

'word.istitle()': word.istitle(),

Is upper

Can be useful for recognition of abbreviations or I token for example.

'word.isupper()': word.isupper(),

Another improvements and features

Some more advanced techniques can be used for improving model accuracy. For example some attributes of neighbors can be added to features(if previous token is number probability of current token being noun is bigger, maybe :).

Position in the sentence also can affect the results. It’s more likely to be noun or proposition at the beginning of the sentence than other part of speech.

Contribution

It is study project and it is not ready for any kind of production usage, but feel free to contribute via PR, issue, comment or any other way.

abcdw/crf-pos-tagger