GrammaticalFramework/gf-core

Spaces required for languages which do not have them

s5bug opened this issue · 4 comments

s5bug commented

In the repl, say I i -path=alltenses -retain alltenses/AllEng.gfo. Then I can parse, p "cheese is small", and get out a result.

If I want to parse the same phrase in Japanese, with i -path=alltenses -retain alltenses/AllJpn.gfo, then p "チーズは小さい" does not work. I must give the parser p "チーズ は 小さい" for it to output what I want.

The input I am receiving does not have spaces in it, is there any way to make this work without partitioning the text? Is this an issue with the RGL instead of GF? Is this an issue with using the REPL?

Hi @s5bug ,

For the most immediate question, the spaces are a feature of the Japanese RGL.

More broadly, GF grammars are not good with spaces. While theoretically you could write a grammar where "チーズは小さい" is one token, in practice that would be very silly.

The usual options are

  • Preprocessing your input before parsing, postprocessing after linearisation
  • Using the &+ token (see Angelov 2015) in the grammar and using exclusively the C-shell version of the GF executable. This part of my blog post has a brief explanation on the &+ token and the different runtimes, but don't hesitate to ask if the existing documentation isn't clear enough.
s5bug commented

Unfortunately I'm developing on Windows, so -cshell is unavailable to me to test with.

What do you mean by preprocessing? If you mean manual preprocessing, that won't be an option for me as my goal is to have this built into an autonomous service.

By preprocessing I mean that you can write a script in any other programming language, which tokenises the Japanese input before giving it to the GF service. I'd imagine that NLTK has a Japanese tokenizer, but if your corpus is really simple, even a stupid regex approach could probably work: like split at both sides of は, and split when you have a kanji following a kana.

s5bug commented

Alright, I think I'm able to use Apache's OpenNLP then.