拼音转汉字, convert pinyin to 汉字 using deep networks
As a light-weight example, training data are downloaded from the AI shell speech recognition corpus, found in http://openslr.org/33/. The transcripts rather than the audio data are used. A copy of the transcript file is found in the ./data folder
- The project uses pytorch (>=1.3.0) and torchtext. Python3.7 is recommended
- git clone https://github.com/ranchlai/pinyin2hanzi.git
- cd pinyin2hanzi
- virtualenv -p python3.7 py37
- source py37/bin/activate
- pip install -r requirements.txt
- Run process_ai_shell_transcript_sd.py to convert ai-shell transcripts from 汉字 to 带声调的拼音(pinyin with tones)
- Run train.py to train the model, or you can download the pretrained model and put it to the "model" subfolder
- Run inference_sd.py to do inference
Pretrained model using AI-shell transcript file can be downloaded from gooole drive
néng gòu yíng de bǐ sài zhēn de hěn kāi xīn
能够赢得比赛真的很开心
yě qǔ de le sān xiàn piāo hóng de chéng jī
也取得了三线飘红的成绩
guó yǒu qǐ yè bù bì yě wú xū jiè rù yíng lì xìng qiáng de shāng pǐn fáng kāi fā
国有企业不必也无需介入盈利性强的商品房开发
sī fǎ jiàn dìng jī gòu shì dú lì fǎ rén
司法鉴定机构是独立法人
bǎo bǎo zhòng wǔ diǎn èr jīn
宝宝重五点二斤
Some of the code borrowed from https://github.com/bentrevett/pytorch-seq2seq
Model architecture was designed by myself. It is very likely that the model looks partly the same as some existing works, I apologized for not citing them. Please let me know if you think I should cite some papers.