Add Korean, Indonesian, and Hebrew support
Opened this issue · 4 comments
Support the above languages in Patapsco
I'm getting ready to run some Korean data. Do you have a recommendation for how to go about selecting elements for the process stage? Spacy has a Koren pipeline... perhaps moses would work but I'm not sure.
If I had a Korean IR test collection I wouldn't be asking this question ;-)
Stop words were merged in from pull request #48. I tested the pipeline on some Korean documents a few months back. I think Patapsco defaulted to the UD tokenization model. I had someone who reads Korean take a look and she thought it was reasonable. UD tokenization stats here: https://explosion.ai/blog/ud-benchmarks-v3-2
So short answer is that you can set the language code and Patapsco should just work for Korean.
I will note that there is an issue when running Patapsco with multiple processes the first time it tries to download a model - basically a race condition - I need to restructure how the models get downloaded automatically in the multiprocessing setting.
Sorry @isoboroff - forgot to tag you in my response
Indeed, with an up-to-date repo Korean indexes fine and passes sanity-check searches.
NTCIR 3-6 has Korean data in their CLIR track. I've reached out to Noriko Kando to get the data, but I bet Paul McNamee has it already.