Support language which need tokenizer (Chinese, Japanese .etc)
eromoe opened this issue · 7 comments
I think iepy
need a common interface to embed a tokenizer to support language like Chinese, Japanese .etc.
There is a old ie project with gui named GATE
, it contain a pre-trained model and dataset, maybe helpful
https://gate.ac.uk/sale/tao/splitch15.html#sec:misc-creole:language-plugins:chinese
Hello. The preprocessing pipeline can be customized to introduce a different tokenizer. See for instance:
Hello @francolq ,
I have seen how to customise in docs:
pipeline = PreProcessPipeline([
CustomTokenizer(),
CustomSentencer(),
CustomLemmatizer(),
CustomPOSTagger(),
CustomNER(),
CustomSegmenter(),
], docs)
pipeline.process_everything()
Then I look into the code , preprocess.tokenizer.TokenizeSentencerRunner
seems not be used in anywhere. And I found:
- one pipeline may have multiple runner
- one runner may have step or not
As I see, there is not just as simple as adding a tokenizer since some runners are relative.It is a little hard to customise without knowing the input and output of each runner and step format and the runner api design principle (Currently I have to view the code and tried to understand what it does, but due to knowledge and language limitation, I may stuck at some place). I would like to help to make iepy compatible with CJK language if anyone could provide the api principle to write the runners. @machinalis @jmansilla
@eromoe Right now, I want iepy to customize to Chinese, could you give me a hand ?
@YanWenqiang Sorry, I was just need the annotator and object binding of iepy, since it was not easy to integrate Chinese , I have already made my own now.
@eromoe All right. Thanks a lot. Now I was also met with this trouble, I really need someone could help me.