YodaQA in non-English languages
pasky opened this issue · 1 comments
Quite often, we recently get enquiries about porting YodaQA to other languages than English. It's not insormountable, as there are just a few language-specific components in YodaQA. This issue is meant to give an overview of what needs to be done (akin to #17 for domain porting) as well as possibly track progress to make YodaQA more language-independent (isolating everything language-specific in a better-defined way and improving the porting docs).
First, one option is to just translate the question to English and translate the answer back. :) For some applications, this may be feasible. We won't discuss this option further.
I'll edit the comment below later in case I forget something. For general intro in how YodaQA works, the best resource right now are the slides linked from https://ailao.eu/yodaqa/science.html - we assume familiarity with that.
First, it's important to come up with some set of gold standard questions in your language - that is, examples of typical user questions + correct answers. This is important mainly to clarify your expectations of the QA system and evaluating how good it is and what the next weak points are. We also use such a set in English to train YodaQA's machine learning classifiers, but for that the bare minimum is ~400 training questions, which is a lot of work; on the other hand, it should be fine not to retrain YodaQA at first during porting and just rely on the pretrained classifiers, then you need just a bunch of examples to start with (say, 50+).
We chiefly use Stanford Parser - if it supports your language, that greatly improves the situation. Another matter is if it supports deep parsing incl. constituents or only part-of-speech tagging. No matter which, you'll probably need to review question analysis classes FocusGenerator and SVGenerator which contains grammatic rules for finding question focus and selection verb. Basically, you'll probably want to rewrite the rules in these two classes, but don't worry - it's pretty easy to come up with some rules based on parses of your gold standard questions (you did collect 50+, right?) - even without deep parsing, just based on part-of-speech tags and word positions.
For LAT handling, we use Wordnet - it'd be good to have a port to your language. It's not 100% necessary but for good functionality, another way of dealing with word similarity must be introduced if it's not available. You'll also need to edit LATByFocus class to replace the English words (synsets) with your language's, and LATByQuantity to do the same thing.
We also rely on DBpedia (which you can get in most languages), Freebase (which should have some labels in many languages, but not in the endpoint we run!) and Wikipedia (obviously available in most languages). You can use only a subset of these resources too. If you want to answer questions based on texts, not just RDF knowledge bases, you'll need a parser that can output constituents, or rely on a named entity detector.
For entity linking, we use DBpedia in combination with a dataset that's part of the label-lookup repo - you'd need to replace that with your list of Wikipedia article labels; not sure if crosswikis equivalent
dataset (which we also use for label lookup) is available too for other languages.
Actually, as a result of a recent one-day hackathon, we have created a very basic port of YodaQA to Czech. This work is in the d/movies branch, so looking at the diff can give you some idea about the essentials - it's not that a big diff!
This localization does not use Wikipedia and Freebase, only DBpedia. One important divergence from the above is that we used external (unfortunately not openly available) tools for entity linking and part-of-speech tagging (deep parsing is not available for Czech). It's still a work in progress.