redpony/cdec

tokenize-anything.sh on Italian

Opened this issue · 2 comments

Hi,
I just wanted to let you know an error the tokenize-anything.sh script makes for Italian sentences, that is it doesn't split "C'è" ("There's").

This also applies to other contractions whose second part is "'è".

Examples of other contractions that should be split, but aren't:

l'uomo
all'interno
nell'obligo

These involve articles. Before a vowel, definite articles are spelled l'. Combining with prepositions yields all', dall', dell', nell', sull'. The feminine indefinite article is realized as un' before a vowel.