Reynir
Natural language processing for Icelandic
Reynir is an experimental project that aims to extract processable information from Icelandic text on the web. It scrapes chunks of text from web pages, tokenizes them, parses them and analyzes the resulting parse trees to obtain statements of fact and relations between stated facts.
Reynir will be most effective for text that is highly factual, i.e. has a relatively high ratio of concrete concepts such as numbers, amounts, dates, person and entity names, etc.
Reynir is innovative in its support for the complex grammar of Icelandic - using cases, genders and number (singular/plural) applied respectively to nouns, verbs, adjectives and prepositions to disambiguate parses. Its Earley parser is fast and compact enough to make real-time while-you-wait analysis of web pages, as well as bulk processing, feasible.
If successful in its initial stages, Reynir may in due course be expanded, for instance:
- to make logical inferences from statements in its database;
- to find statements supporting or refuting a thesis; and/or
- to discover contradictions between statements.
Implementation
Reynir is written in Python 3.4, apart from the web front-end which has small amounts of JavaScript.
Reynir works in stages, roughly as follows:
- Web scraper, built on BeautifulSoup
- Tokenizer, relying on the BÍN database of Icelandic word forms
- Parser, using an Earley algorithm to parse text according to an unconstrained context-free grammar that may be ambiguous
- Parse forest analyzer and information extractor
Reynir contains a small web server that allows the user to type in any URL and have Reynir scrape it, tokenize it and display the result as a web page. The server runs on the Flask framework.
Reynir uses the official BÍN (Beygingarlýsing íslensks nútímamáls) database of Icelandic word forms to identify and tokenize words, and find their grammatical roots and forms. The database has been downloaded from the official BÍN website and stored in PostgreSQL.
The tokenizer divides text chunks into sentences and recognizes entities such as dates, numbers, amounts and person names, as well as common abbreviations and punctuation.
Grammar rules are laid out in a separate text file, Reynir.grammar
. The standard
Backus-Naur form has been
augmented with repeat specifiers for right-hand-side tokens (*
for 0..n instances,
+
for 1..n instances, or ?
for 0..1 instances). Also, the grammar allows for
compact specification of rules with variants, for instance due to cases, numbers and genders.
Thus, a single rule (e.g. NounPhrase/case/gender -> Adjective/case noun/case/gender
)
is automatically expanded into multiple rules (8 in this case, 4 cases x 2 genders) with
appropriate substitutions for right-hand-side tokens depending on their local variants.
The parser is an Earley parser as enhanced by Scott et al referencing Tomita. It parses ambiguous grammars without restriction and returns a compact Shared Packed Parse Forest (SPPF) of parse trees. If a parse fails, it identifies the token at which no parse was available.
File details
main.py
: Web serversettings.py
: Management of global settings and configuration datatokenizer.py
: Tokenizer, designed as a pipeline of Python generatorsgrammar.py
: Parsing of.grammar
files, grammar constructsparser.py
: Earley parser generic base classesbinparser.py
: Parser related subclasses for BIN (Icelandic word) tokensptest.py
: Parser test moduleReynir.conf
: Editable configuration file for the tokenizer and parserReynir.grammar
: A context-free grammar specification for Icelandic written in BNF with extensions for repeating constructs (*
,+
) and optional constructs (?
)
Copyright and licensing
The intent is to release Reynir under a GNU GPL license once the code stabilizes. However, while Reynir is still in early stages of development the code is copyright (C) 2015 by Vilhjalmur Thorsteinsson, all rights reserved.