Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

Asturian (ast) (108,792 pairs)
Bulgarian (bg) (30,323 pairs)
Catalan (ca) (591,534 pairs)
Czech (cs) (36,400 pairs)
English (en) (41,760 pairs)
Estonian (et) (80,536 pairs)
French (fr) (224,002 pairs)
Galician (gl) (392,856 pairs)
German (de) (358,473 pairs)
Hungarian (hu) (39,898 pairs)
Irish (ga) (415,502 pairs)
Manx Gaelic (gv) (67,177 pairs)
Italian (it) (341,074 pairs)
Persian/Farsi (fa) (6,273 pairs)
Polish (pl) (3,296,232 pairs)
Portuguese (pt) (850,264 pairs)
Romanian (ro) (314,810 pairs)
Russian (ru) (537,810 pairs)
Scottish Gaelic (gd) (51,624 pairs)
Slovak (sk) (858,414 pairs)
Slovene (sl) (99,063 pairs)
Spanish (es) (497,560 pairs)
Swedish (sv) (675,137 pairs)
Ukrainian (uk) (193,703 pairs)
Welsh (cy) (359,224 pairs)

Licence

Available under the Open Database License

Sources

Various Hunspell dictionaries from the OpenOffice.org website
Deutsches Morphologie-Lexikon by Daniel Naber
Lexique by Boris New and Christophe Pallier
e_lemma.txt by Yasumasa Someya
Multext East (only those morphological lexicons that are under a free licence are used)
Morphological dictionaries from FreeLing
SALDO morphological lexicon
Irish National Morphology Database
Various lists by Kevin Scannell
OpenRussian.org

rotoava/lemmatization-lists

Lemmatization Lists