This is project for releasing some open-source natural language models from Joint Lab of BAAI and JDAI. Different from other open-source Chinese NLP models, we mainly focus on some basic models for dialogue systems, especially in E-commerce domain. Our corpus is very huge, currently we are using 42 GB Customer Service Dialogue Data (CSDD) for training, and it contain about 1.2 billion sentences.
We provide the pre-trained BERT and word embeddings. The charts below shows the data we use.
Task | Data Source | Sentences | Sentence Pairs |
---|---|---|---|
BERT Pre-Training | Customer Service Dialogue Data(CSDD) | 1.2B | - |
FAQ | LCQMC、CSDD | - | 0.88M for fine-tuning, 80K for test |
Task | Data Source | Tokens | Vocabulary Size |
---|---|---|---|
Word Embedding Pre-Training | CSDD | 9B | 1M |
The links to the models are here.
Model | Data Source | Link |
---|---|---|
BAAI-JDAI-BERT, chinese | CSDD | JD-BERT for Tensorflow |
BAAI-JDAI-WordEmbedding | CSDD | JD-WORD-EMBEDDING with 300d |
JD-BERT.tar.gz file contains items:
|—— BAAI-JDAI-BERT
|—— bert_model.cpkt.* # pre-trained weights
|—— bert_config.json # hyperparamters of the model
|—— vocab.txt # vocabulary for WordPiece
|—— JDAI-BERT.md & INTRO.md # summary and details
JD-WORD-EMBEDDING.tar.gz file contains items:
|—— BAAI-JDAI-WORD-EMBEDDING
|—— JDAI-Word-Embedding.txt # word vectors, each line separated by whitespace
|—— JDAI-WORD-EMBEDDING.md & INTRO.md # summary and details
Masking | Dataset | Sentences | Training Steps | Device | Init Checkpoint | Init Lr |
---|---|---|---|---|---|---|
WordPiece | CSDD | 1.2B | 1M | P40×4 | BERTGoogle weight | 1e-4 |
- When pre-training, we do the data preprocessing with our preprocessor including some generalization processing.
- The init checkpoint we use is <12-layer, 768-hidden, 12-heads, 110M parameters>, and bert_config.json and vocab.txt are identical to Google's original settings.
We use train data of LCQMC, QQ dataset and QA dataset for fine-tuning, and then just train 2 epoches with a proper init learning rate 2e-5 on each dataset respectively. QQ dataset and QA dataset are extracted from other CSDD.
We evaluate our pre-trained model on the FAQ task with test data of LCQMC, QQ dataset and QA dataset.
Model | LCQMC | QQ Test | QA Test |
---|---|---|---|
BERT-wwm | 88.7 | 80.9 | 86.6 |
Our BERT | 88.6 | 81.9 | 87.5 |
LCQMC
,QQ Test
and QA Test
are the test data containing 5k question pairs, 21k question pairs and 54k question&answer pairs respectively.
Window Size | Dynamic Window | Sub-sampling | Low-frequency Word | Iter | Negative Samplingfor SGNS | Dim |
---|---|---|---|---|---|---|
5 | Yes | 1e-5 | 10 | 10 | 5 | 300 |
- When pre-training, we use our tools for preprocessing and word segmentaion.
- We train the vectors based on Skip-Gram.
We show top3 similar words for some sample words below. We use cosine distance to compute the distance of two words.
Input Word | 口红 | 西屋 | 花花公子 | 蓝月亮 | 联想 | 骆驼 |
---|---|---|---|---|---|---|
Similar 1 | 唇釉 | 典泛 | PLAYBOY | 威露士 | 宏碁 | CAMEL |
Similar 2 | 唇膏 | 法格 | 富贵鸟 | 增艳型 | 15IKB | 骆驼牌 |
Similar 3 | 纪梵希 | HS1250 | 霸王车 | 奥妙 | 14IKB | 健足乐 |