Add Korean, Indonesian, and Hebrew support

Question

Add Korean, Indonesian, and Hebrew support

Opened this issue 2 years ago · 4 comments

Support the above languages in Patapsco

Answer 1 · 2022-11-22T18:15:41.000Z

I'm getting ready to run some Korean data. Do you have a recommendation for how to go about selecting elements for the process stage? Spacy has a Koren pipeline... perhaps moses would work but I'm not sure.

If I had a Korean IR test collection I wouldn't be asking this question ;-)

Answer 2 · 2022-11-22T18:49:45.000Z

Stop words were merged in from pull request #48. I tested the pipeline on some Korean documents a few months back. I think Patapsco defaulted to the UD tokenization model. I had someone who reads Korean take a look and she thought it was reasonable. UD tokenization stats here: https://explosion.ai/blog/ud-benchmarks-v3-2

So short answer is that you can set the language code and Patapsco should just work for Korean.

I will note that there is an issue when running Patapsco with multiple processes the first time it tries to download a model - basically a race condition - I need to restructure how the models get downloaded automatically in the multiprocessing setting.

Answer 3 · 2022-11-22T19:19:10.000Z

Sorry @isoboroff - forgot to tag you in my response

Answer 4 · 2022-11-23T13:10:34.000Z

Indeed, with an up-to-date repo Korean indexes fine and passes sanity-check searches.

NTCIR 3-6 has Korean data in their CLIR track. I've reached out to Noriko Kando to get the data, but I bet Paul McNamee has it already.