/reldi-tokeniser

A two-mode (standard, nonstandard) tokeniser for South Slavic languages

Primary LanguagePythonApache License 2.0Apache-2.0

reldi-tokeniser

A tokeniser developed inside the ReLDI project. Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.

Usage

Command line

$ echo 'kaj sad s tim.daj se nasmij ^_^.' | ./tokeniser.py hr -n
1.1.1.1-3	kaj
1.1.2.5-7	sad
1.1.3.9-9	s
1.1.4.11-13	tim
1.1.5.14-14	.

1.2.1.15-17	daj
1.2.2.19-20	se
1.2.3.22-27	nasmij
1.2.4.29-31	^_^
1.2.5.32-32	.


Language is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.

$ python tokeniser.py -h
usage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg}

Tokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and
Bulgarian

positional arguments:
  {sl,hr,sr,mk,bg}   language of the text

optional arguments:
  -h, --help         show this help message and exit
  -c, --conllu       generates CONLLU output
  -b, --bert         generates BERT-compatible output
  -d, --document     passes through ConLL-U-style document boundaries
  -n, --nonstandard  invokes the non-standard mode
  -t, --tag          adds tags and lemmas to punctuations and symbols

Python module

# string mode
import reldi_tokeniser

text = 'kaj sad s tim.daj se nasmij ^_^.'

output = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True)

# object mode
from reldi_tokeniser.tokeniser import ReldiTokeniser

reldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True)
list_of_lines = [el + '\n' for el in text.split('\n')]
test = reldi.run(list_of_lines, mode='object')

Python module has two mandatory parameters - text and language. Other optional parameters are conllu, bert, document, nonstandard and tag.

CoNLL-U output

This tokeniser outputs also CoNLL-U format (flag -c/--conllu). If the additional -d/--document flag is given, the tokeniser passes through lines starting with # newdoc id = to preserve document structure.

$ echo '# newdoc id = prvi
kaj sad s tim.daj se nasmij ^_^.
haha
# newdoc id = gidru
štaš' | ./tokeniser.py hr -n -c -d
# newdoc id = prvi
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1	kaj	_	_	_	_	_	_	_	_
2	sad	_	_	_	_	_	_	_	_
3	s	_	_	_	_	_	_	_	_
4	tim	_	_	_	_	_	_	_	SpaceAfter=No
5	.	_	_	_	_	_	_	_	SpaceAfter=No

# sent_id = 1.2
# text = daj se nasmij ^_^.
1	daj	_	_	_	_	_	_	_	_
2	se	_	_	_	_	_	_	_	_
3	nasmij	_	_	_	_	_	_	_	_
4	^_^	_	_	_	_	_	_	_	SpaceAfter=No
5	.	_	_	_	_	_	_	_	_

# newpar id = 2
# sent_id = 2.1
# text = haha
1	haha	_	_	_	_	_	_	_	_

# newdoc id = gidru
# newpar id = 1
# sent_id = 1.1
# text = štaš
1	štaš	_	_	_	_	_	_	_	_

Pre-tagging

The tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag -t or --tag), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.

$ echo -e "kaj sad s tim.daj se nasmij ^_^. haha" | python tokeniser.py hr -n -t
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1	kaj	_	_	_	_	_	_	_	_
2	sad	_	_	_	_	_	_	_	_
3	s	_	_	_	_	_	_	_	_
4	tim	_	_	_	_	_	_	_	SpaceAfter=No
5	.	.	PUNCT	Z	_	_	_	_	SpaceAfter=No

# sent_id = 1.2
# text = daj se nasmij ^_^.
1	daj	_	_	_	_	_	_	_	_
2	se	_	_	_	_	_	_	_	_
3	nasmij	_	_	_	_	_	_	_	_
4	^_^	^_^	SYM	Xe	_	_	_	_	SpaceAfter=No
5	.	.	PUNCT	Z	_	_	_	_	_

# sent_id = 1.3
# text = haha
1	haha	_	_	_	_	_	_	_	_