Can it support Chinese?

Question

Can it support Chinese?

Closed this issue 6 years ago · 5 comments

I just change nlp = spacy.blank("en") to nlp = spacy.blank("zh")
Is that ok?

Answer 1 · 2018-06-11T02:08:35.000Z

Hi, I haven't tried it on Chinese data but I'm assuming a special preprocessing step is required for Chinese characters as they aren't easily separable to characters and words. Also glove word vectors should be changed to Chinese word glove.

Answer 2 · 2018-06-11T04:04:00.000Z

Yes, spaCy only supports to separate Chinese words. But it does't support other nlp basic functions, such as POS and NER. However, StanfordCoreNLP can do this.
I wonder whether this model has used other features that needs nlp functions beyond separating words. If so, I suppose I should change StanfordCoreNLP.
And, yes, I used word2vec in Chinese from here.

Answer 3 · 2018-06-11T05:09:23.000Z

@zhongyuchen Probably it is better to use FastText than w2v :) https://fasttext.cc/docs/en/crawl-vectors.html

Answer 4 · 2018-06-12T02:54:18.000Z

I wonder if there are any Chinese data sets with same format available for use.
I got the dureader data set from here, but found the answer is not a span from context.

Answer 5 · 2018-06-12T05:52:54.000Z

@yangyuji12 You can try to translate SQuAD paragraphs and questions to Chinese and then find an answer in paragraph (it could be the most difficult part). The authors described this approach in "Data Augmentation by backtranslation" section