Presage

This is a fork of Presage developed further to incorporate into Sailfish OS keyboard as a prediction module.

This fork should support the same platforms as the upstream. However, it is developed on Linux and may lack testing on other platforms.

All changes introduced in the fork are listed at https://github.com/sailfish-keyboard/presage/compare/upstream...master

Main changes:

Addition of MARISA-based Predictor. Compared to SQLite-backed predictor, MARISA-based predictor is much faster and requires several times smaller databases
Addition of Hunspell Predictor.
Addition of forget word API
Fixes required to make it work on Sailfish OS (loading empty configurations and others)
Conversion of floating point numbers in XML configuration is performed in "C" locale

See upstream README for general introduction and build instructions.

Generation of n-gram database for MARISA-based predictor

There are different ways to generate n-gram databases. One is to use included text2ngram tool and make the database by processing text corpus with it. Alternative, is to use NLTK or some other package. Finally, you could use an existing database and convert to a suitable format. Here, text2ngram and NLTK approaches are covered.

In general, its advisable to clean corpus before n-gram database generation. Also, it is suggested not to store numbers in that database since they probably have small use in text prediction context. To check which characters are in your corpus, you could use utils/charmap.py script.

n-gram database by text2ngram

To generate n-gram database for MARISA-based predictor, make SQLite based n-gram database first. For that, use the provided text2ngram utility and build sequential n-gram tables in SQLite format. For example

for i in 1 2 3; do text2ngram  -n $i -l -f sqlite -o database_aa.db mytext.filtered; done

will generate database covering 1, 2, and 3-gram cases.

With SQLite database ready, run Python script utils/sqlite2marisa.py to convert n-gram database. For example

utils/sqlite2marisa.py database_aa.db database_aa

If needed, MARISA database can be reduced by cutting off n-grams using threshold command line option of the converter.

Note that endianness of the system generating the database and the device on which you plan to use it should be the same.

n-gram database by NLTK

Natural Language Toolkit NLTK is a Python library packaging many related tools. With the respect of n-gram database generation, its important to split the text into words, generate all existing n-grams and count their occurrence. Taking into account that each language has its own character set and tokenization rules, the scripts for n-gram generation would have to be adjusted to used corpus and language.

As an example script, utils/process_en.py is the script used to parse English corpus based on OANC and OpenSubtitles dump from OPUS (described in Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012). Corpus was generated from OANC text files, OpenSubtitles parsed by https://github.com/inikdom/opensubtitles-parser and all cat together.

For those not familiar with NLTK, I suggest to use utils/process_en.py as a base and adjust to your language. Please submit the scripts to utils for language processing, that way it would be possible to learn new ways and adjust processing for everyone if needed. In case of English, an existent word tokenization was adjusted to keep contracted words together.

The processing of the corpus, should generate n-gram database in UTF-8 text file that has the following line format:

NGRAM word word ... word\tCOUNT

where NGRAM is 1 (for word frequencies), 2 or more, words of the n-gram are separated by space from NGRAM and between each other, and COUNT is the number of times it occured in the corpus. Note that COUNT is separated from the last word by TAB. The order of n-grams in this file doesn't matter.

Generated n-gram text file can be converted into MARISA database using utils/ngramtxt2marisa. See its help for details.

Packaging for Sailfish OS

As soon as the database is ready, it is easy to package it for Sailfish by using a provided script packaging/sailfish-language/package-language.sh . For that, you need Linux PC with rpmbuild and sed installed in the path. Note that rpmbuild is available for Linux distributions that don't use RPMs for native packaging.

To create RPM with Presage language support, run

packaging/sailfish-language/package-language.sh Language langcode database-directory version

where

Language: Specify language in English starting with the capital letter, ex 'Estonian'
langcode: Specify language code in the same notation as Hunspell, ex 'en_US'.
database-directory: Directory path with the MARISA-formatted database
version: Version of the language package, ex '1.0.0'

When finished, language support will be packaged into RPM in the current directory. For example, Estonian database is packaged using

packaging/sailfish-language/package-language.sh Estonian et_EE database_et 1.0.0

Hunspell dictionaries

Until Presage will support fully conversion between encodings, it is expected that Hunspell dictionary is in the same encoding as the input. Hence, for Sailfish, it is recommended to convert dictionary into UTF8 encoding.

For that, one can use iconv in Linux. First, check what is the encoding of the dictionary by examining the first line in the affix file. In the case of Estonian (et_EE.aff), it is

SET ISO8859-15

Then run conversions:

iconv -f ISO-8859-15 -t UTF8 /usr/share/hunspell/et_EE.aff -o et_EE.aff
iconv -f ISO-8859-15 -t UTF8 /usr/share/hunspell/et_EE.dic -o et_EE.dic

and in the new affix file replace the first line with SET UTF-8.

After conversion, it is also possible to use a script packaging/sailfish-language/package-hunspell.sh to provide Hunspell dictionaries. For that, use this script to package affix and dictionary file in a way that owill make it simple to use by any Hunspell library using application, including Presage Hunspell predictor. See script help for details.