Farsi-Verb-Tokenizer

Tokenizes Farsi Verbs.

Usage:

perl farsi-verb-tokenizer.perl [-d resource-directory] input-file [pattern_id]

resource-directory: The directory where resource files are located. This directory by default is "./resource".

input-file: A list of sentences to be tokenized.

pattern_id: The id of the transformation to be applied, a number between 0 and the number of transformations (that is the number of lines in "transformation.tsv") minus 1, inclusive.

output: The list of tokenized sentences.

If no pattern_id is passed, all the transformations will be applied to each individual sentence one by one. The transformations in "resource/transformations.tsv" are sorted from largest to smallest, so that if a larger transformation matches (and hence transforms) a substring of a sentence, smaller transformations could no longer affect that substring.

Example:

perl ./script/farsi-verb-tokenizer.perl test-input.txt > test-output.txt

The output should be exactly the same as test-tokenized.txt. You can check that running the following diff command:

diff test-output.txt test-tokenized.txt

mehdi-manshadi/Farsi-Verb-Tokenizer

Farsi-Verb-Tokenizer