|-bert-base-chinese —— BERT预训练模型文件(pytorch)
| |--config.json —— BERT配置文件
| |--pytorch_model.bin —— BERT模型
| |--vocab.txt —— BERT词表
|-data —— 数据
| |--news —— 领域数据
| |---train.txt
| |---dev.txt
| |---test.txt
| |---class.txt —— Span方法使用的标签格式(实体类别)
|-multi_domain —— 多领域NER
| |--data_processor.py —— 数据集构建方法
| |--model.py —— 模型方法(SPAN、CRF、SoftMax)
| |--multi_domain_ner.py —— 主方法(包括训练、验证、测试等)
| |--utils.py —— 一些基础函数
python --3.6
torch --1.4.0
torchvision --0.5.0
tensorboard --2.3.0
tensorboardX --2.1
transformers --3.1.0
tqdm --4.49.0
参数 | 描述 | 解释 |
-h, --help | show this help message and exit | |
--data_dir DATA_DIR | The data folder path. | 数据集的根目录 |
--domain DOMAIN | The domain names (multiple domains separated by *) | 领域名也是文件夹名(以*分开) |
--train | Training | 训练 |
--dev | Development. | 验证 |
--test | Testing. | 测试 |
--output_dir OUTPUT_DIR | The output folder path. | 输出文件夹 |
--model MODEL | The model path. | 验证和测试的模型路径 |
--architecture {span} | The model architecture of neural network and what decoding method is adopted. | 模型可选{span,crf} |
--train_batch_size TRAIN_BATCH_SIZE | The number of sentences contained in a batch during training. | 训练的一批句子数 |
--test_batch_size TEST_BATCH_SIZE | The number of sentences contained in a batch during testing. | 验证或测试的一批句子数 |
--epochs EPOCHS | Total number of training epochs to perform. | 训练最大轮数 |
--learning_rate LEARNING_RATE | The initial learning rate for Adam. | 学习率 |
--crf_lr CRF_LR | The initial learning rate of CRF layer. | CRF层的学习率 |
--dropout DROPOUT | What percentage of neurons are discarded in the fully connected layers (0 ~ 1). | 全连接层Dropout丢失率 |
--max_len MAX_LEN | The Maximum length of a sentence. | 句子最大长度(如果实际句子过长则按照split集切分) |
--keep_last_n_checkpoints KEEP_LAST_N_CHECKPOINTS | Keep the last n checkpoints. | 保留最后的几轮模型 |
--warmup_proportion WARMUP_PROPORTION | Proportion of training to perform linear learning rate warmup for. | warmup |
--split SPLIT | Characters that segments a sentence. | 句子可以切分的字符(如标点) |
--tensorboard_dir TENSORBOARD_DIR | The data address of the tensorboard. | Tensorboard路径 |
--domain_loss_rate DOMAIN_LOSS_RATE | Weight of domain loss. | 领域分类器损失比重 |
--domain_ner_loss_rate DOMAIN_NER_LOSS_RATE | Weight of domain ner loss. | 集成SPAN损失比重 |
--bert_config_file BERT_CONFIG_FILE | The config json file corresponding to the pre-trained BERT model. This specifies the model architecture. | BERT预训练模型 |
--cpu | Whether to use CPU, if not and CUDA is avaliable can use CPU. | 如果使用CPU |
--seed SEED | random seed for initialization. | 随机种子 |
--train --dev --test 分别代表运行方式
- 只使用 --train 则只训练到固定轮数,保存为最后的模型 checkpoint-last.kpl
- 若使用 --train 和 --dev 则会额外域保存在开发集上的最高分数的模型 checkpoint-best.kpl
- --test 则为测试方式如存在 checkpoint-best.kpl 则使用该模型,否则使用 checkpoint-last.kpl