任务型对话系统

1. Rasa Review

使用Rasa框架进行二次开发,完成任务型的对话系统搭建。
（1）进入rasa官网了解rasa的详情；
（2）了解rasa基础模型文件：Rasa-nlu 和 Rasa-core

（3）Rasa的安装：在Linux或Mac OS中安装较为方便，而Windows安装需要进行编译，较为繁杂。

pip install rasa_core==0.9.8
pip install -U scikit-learn sklearn-crfsuite
pip install git+https://github.com/mit-nlp/MITIE.git
pip install jieba

（4）Rasa的对话流程pipeline：

language: "zh"

pipeline:
  - name: "nlp_mitie"  # 命名实体识别，词向量训练
    model: "data/total_word_feature_extractor.dat"  # 加载通过mitie预训练的词向量模型
  - name: "tokenizer_jieba"  # 结巴分词
    dictionary_path: "nlu_data/jieba_dictionary.txt"  # jieba自定义词典
  - name: "ner_mitie"  # 实体识别
  - name: "ner_synonyms"  # 同义词替换
  - name: "intent_entity_featurizer_regex"  # 额外的正则特征
  - name: "intent_featurizer_mitie"  # 意图特征提取（通过词向量，把每个词的词向量相加后取平均，作为句子特征的表示，作为sk-learn的输入）
  - name: "intent_classifier_sklearn"  # 意图识别分类器

2. 项目搭建

2.1 项目目录

vip-chatbot
  |——consolution
      |——answer  # 问答库相关映射文件
      |    |——qa.json  # 正常问答时，action到答案的映射文件
      |    |——qa_by_entity.json  # 单轮Fallback时，实体与相关问题和答案的映射文件
      |    |——qa_by_intent.json  # 单轮Fallback时，意图与相关问题和答案的映射文件
      |——core_data  
      |    |——domain.yml  # 定义意图，实体，槽，action，模板
      |    |——story.md  # 意图与action的故事脚本
      |——models  # 训练后保存的模型
      |    |——nlu  # 训练好的rasa-nlu意图分类模型
      |    |——dialogue  # 训练好的rasa-core模型
      |——nlu_data
           |——chatito  # 定义句子模板，用于生成rasa-nlu格式的训练数据
           |——train_data  # 生成后的rasa-nlu意图分类器训练数据
              |——rasa_dataset_training.json  # chatito生成的json格式的样本，定义了同义词
              |——regex.json  # 定义的正则，用于额外的正则特征提取
      static  # 网页版的咨询机器人
      bot.py  # rasa-nlu和rasa-core训练与rasa对话系统运行接口
      myregex_entity_extrator.py  # 自定义的实体提取类
      pipeline_config.yml  # rasa-nlu的流水线定义文件
      webchat.py  # 网页版机器人启动的python脚本
      vip_action.py  # 执行所有的action，找到最佳答案

2.2 Rasa-nlu训练数据准备
- （1）确定意图：如办卡方式（banka_fangshi）、查询业务（chaxun_work）、使用范围（use_fanwei）
- （2）准备训练数据规则：参考vip-vhatbot/consolution/nlu_data/chatito中的格式书写规则文件。该文件由意图句式和同义词词表组成，排列组合从而批量生成rasa格式的训练样本数据。
- （3）安装nodejs：进入Node.js官网，下载并一路安装，重启终端即可使用npx命令。
- （4）生成训练数据：在终端cd到vip-vhatbot/consolution/nlu_data目录后，执行npx chatito chatito --format=rasa命令，即可在./nlu_data中得到rasa的训练数据rasa_dataset_training.json。将该文件放入vip-vhatbot/consolution/nlu_data/train_data中。
- （5）创建额外正则特征：参考vip-vhatbot/consolution/nlu_data/train_data/regex.json中的格式书写正则特征文件，可以使用这些正则特征来增强特征的表示，以用于意图分类。
- （6）至此完成训练数据的准备，即可开始训练。

2.3 Rasa-core训练数据准备

domain.yml：需要定义槽、意图、实体、action和固定的模版返回（用于问候语或多轮）

  slots:
    槽名1：
      - type: text
    槽名2：
      - type: text
  intents:
    - 意图名1
    - 意图名2
  entities:
    - 实体名1
    - 实体名2
  templates:
    utter_greet:
      - "Hello"
      - "Hi"
    utter_goodbye:
      - "再见，为您服务很开心^_^"
      - "Bye，下次再见"
  actions:
    - action名1
    - action名2
  ```
> * story.md：用意图和action构建了会话的训练数据。
>  
```markdown
  ## story greet 故事name，训练用不到，官方文档提示在debug的时候会显示story的名字
  * greet
    - utter_greet

  ## story goodbye
  * goodbye
    - utter_goodbye

  ## story greet goodbye
  * greet
    - utter_greet
  * goodbye
    - utter_goodbye

  ## story inform num
  * inform_num{"num":"1"}  包含的实体
    - Numaction

vip_action.py：创建预测后的行动到寻找答案的策略文件

myregex_entity_extrator.py：槽实体的正则特征

至此完成训练数据的准备，即可开始训练。

2.4 问答库文件准备：

qa.json：将意图与其答案对应起来。

  # 1. action与答案直接对应（以办理方式为例）
  "Bankafangshi":"提供个人身份证原件和电话号码等信息，即可在官网办理会员卡。"
  # 2. action的不同实体与答案一一对应（以查询业务为例）
  "Chaxunwork":{
    "订单":"在XX卡小程序上点击办卡进度即可查看订单。",
    "余额":"在微信公众号，选“其他-个人中心-我的会员卡”-绑定你的会员卡后首页点击会员卡—“账单查询”按钮，进入账单查询界面即可查询余额。
  }

qa_by_entity.json、qa_by_intent.json：当意图置信度低于阈值时，触发fallback问答，将准备好的问题回复给用户，由用户选择并给予答复，是弥补意图不全或分类不足的方法之一。优先考虑实体相关，其次是意图相关。（这两个文件需要在设计完意图和实体后做）

3.模型训练

Rasa-nlu训练意图分类模型：

  def train_nlu():
      from rasa_nlu.training_data.loading import load_data # 新api,会将目录下的所有文件合并
      from rasa_nlu.config import RasaNLUModelConfig #新 API
      from rasa_nlu.model import Trainer
      from rasa_nlu.config import load

      training_data = load_data("nlu_data/train_data")
      trainer = Trainer(load("pipeline_config.yaml")) # load的返回值就是一个RasaNLUModelConfig对象，而且其初始化需要传入的不是文件名，而是读取的配置文件内容，一个字典
      trainer.train(training_data)
      model_directory = trainer.persist("models/", project_name="nlu",fixed_nmodel_name="model_ner_reg_all") # 意图分类模型保存路径

      return model_directory

Rasa-core训练action预测分类模型：

  def train_dialogue(domain_file="core_data/domain.yml",
               model_path="models/core/dialogue",
               training_data_file="core_data/story.md",
               max_history=3):
      from rasa_core.policies.fallback import FallbackPolicy
      # agent = Agent(domain_file,
      #               policies=[MemoizationPolicy(max_history=2), MobilePolicy()])
      agent = Agent(domain_file, policies=[
          KerasPolicy(MaxHistoryTrackerFeaturizer(BinarySingleStateFeaturizer(),max_history=max_history)),
          FallbackPolicy(fallback_action_name='action_default_fallback',
                         core_threshold=0.3,
                         nlu_threshold=0.3)])
      #如果给的是data的地址，会自动调用load_data
      agent.train(
          training_data_file,
          epochs=200,
          batch_size=16,
          augmentation_factor=50,
          validation_split=0.2
      )

      agent.persist(model_path)
      return agent

Demo运行：
```
$ python webchat.py
```

4.意图分类训练过程详解

4.1 训练总控及数据处理：rasa_nlu/model.py

    def train(self, data, **kwargs):
        # type: (TrainingData) -> Interpreter
        """Trains the underlying pipeline using the provided training data."""
        # 获取训练数据
        self.training_data = data
        # kwargs就是当你传入key=value时存储的字典
        context = kwargs  # type: Dict[Text, Any]
        #遍历检查组件是否缺失
        for component in self.pipeline:
            updates = component.provide_context()
            if updates:
                context.update(updates)

        # Before the training starts: check that all arguments are provided
        if not self.skip_validation:
            components.validate_arguments(self.pipeline, context)

        # data gets modified internally during the training - hence the copy
        working_data = copy.deepcopy(data)
        # 开始每个组件的训练
        for i, component in enumerate(self.pipeline):
            logger.info("Starting to train component {}"
                        "".format(component.name))
            component.prepare_partial_processing(self.pipeline[:i], context)
            updates = component.train(working_data, self.config,
                                      **context)
            logger.info("Finished training component.")
            if updates:
                context.update(updates)

        return Interpreter(self.pipeline, context)

    # 加载mitie用于训练所有词向量的特征，还有维基百科中文的词向量文件：nlu_data/total_word_feature_extractor.dat
    def provide_context(self):
        type: () -> Dict[Text, Any]
        return {"mitie_feature_extractor": self.extractor,
                "mitie_file": self.component_config.get("model")

4.2 自定义训练的流程组件

language: "zh"

pipeline:
- name: "nlp_mitie" # 初始化MITIE
  model: "nlu_data/yue_total_word_feature_extractor.dat"
- name: "tokenizer_jieba"
  dictionary_path: "nlu_data/jieba_dictionary.txt"
- name: "ner_mitie"
- name: "myregex_entity_extractor.MyRegeexEntityExtractor"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"

4.3 ner命名实体识别训练组件，得到最优的惩罚系数C：rasa_nlu/extractors/mitie_entity_extractor.py

  def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig) -> None
        import mitie
        # 加载预训练好的维基百科词向量文件
        model_file = kwargs.get("mitie_file")
        if not model_file:
            raise Exception("Can not run MITIE entity extractor without a "
                            "language model. Make sure this component is "
                            "preceeded by the 'nlp_mitie' component.")
        # 初始化词向量的训练器
        trainer = mitie.ner_trainer(model_file)
        # 线程数为1
        trainer.num_threads = kwargs.get("num_threads", 1)
        found_one_entity = False

        # filter out pre-trained entity examples
        # 遍历加载训练数据中实体实例
        filtered_entity_examples = self.filter_trainable_entities(
                training_data.training_examples)

        for example in filtered_entity_examples:
            sample = self._prepare_mitie_sample(example)

            found_one_entity = sample.num_entities > 0 or found_one_entity
            trainer.add(sample)

        # Mitie will fail to train if there is not a single entity tagged
        if found_one_entity:
            self.ner = trainer.train()

    # 准备实体训练所需要的数据，并返回分词在文本中的位置信息
    def filter_trainable_entities(self, entity_examples):
        # type: (List[Message]) -> List[Message]
        """Filters out untrainable entity annotations.

        Creates a copy of entity_examples in which entities that have
        `extractor` set to something other than self.name (e.g. 'ner_crf')
        are removed."""
        # 储存所有的训练数据的实体内容信息（实体，意图）及其位置信息（始止）
        filtered = []
        # 遍历json文件中的每个训练数据
        for message in entity_examples:
            entities = []
            # 获取每条训练数据中的所有实体信息
            for ent in message.get("entities", []):
                extractor = ent.get("extractor")
                if not extractor or extractor == self.name:
                    entities.append(ent)
            # 更新实体信息
            data = message.data.copy()
            data['entities'] = entities
            # 如语料‘我要上海明天的天气’中的实体（地点，日期）信息：{'intent': 'weather_address_date-time', 'entities': [{'start': 2, 'end': 4, 'value': '上海', 'entity': 'address'}, {'start': 4, 'end': 6, 'value': '明天', 'entity': 'date-time'}]
            filtered.append(
                Message(text=message.text,
                        data=data,
                        output_properties=message.output_properties,
                        time=message.time))

        return filtered

    def _prepare_mitie_sample(self, training_example):
        import mitie
        # 获取训练数据：‘我要上海明天的天气’
        text = training_example.text
        # 分词后的list：['我要','上海','明天','的','天气']
        tokens = training_example.get("tokens")
        sample = mitie.ner_training_instance([t.text for t in tokens])
        # 遍历语料中的实体，地点和时间：{'start': 2, 'end': 4, 'value': '上海', 'entity': 'address'}, {'start': 4, 'end': 6, 'value': '明天', 'entity': 'date-time'}]
        for ent in training_example.get("entities", []):
            try:
                # if the token is not aligned an exception will be raised
                start, end = MitieEntityExtractor.find_entity(
                        ent, text, tokens)
            except ValueError as e:
                logger.warning("Example skipped: {}".format(str(e)))
                continue
            try:
                # mitie will raise an exception on malicious
                # input - e.g. on overlapping entities
                sample.add_entity(list(range(start, end)), ent["entity"])
            except Exception as e:
                logger.warning("Failed to add entity example "
                               "'{}' of sentence '{}'. Reason: "
                               "{}".format(str(e), str(text), e))
                continue
        return sample

    def train(self):
        if self.size == 0:
            raise Exception("You can't call train() on an empty trainer.")
        # Make the type be a c_void_p so the named_entity_extractor constructor will know what to do.
        # 获取最优C参数的训练
        obj = ctypes.c_void_p(_f.mitie_train_named_entity_extractor(self.__obj))
        if obj is None:
            raise Exception("Unable to create named_entity_extractor.  Probably ran out of RAM")
        return named_entity_extractor(obj)

4.4 同义词替换训练组件：rasa_nlu/extractors/entity_synonyms.py

    def train(self, training_data, config, **kwargs):
        # type: (TrainingData) -> None
        # 获取json数据中的同义词信息，加入到self的synonyms参数当中来
        for key, value in list(training_data.entity_synonyms.items()):
            self.add_entities_if_synonyms(key, value)
        # 将实体词加入到self的entity参数当中来
        for example in training_data.entity_examples:
            for entity in example.get("entities", []):
                entity_val = example.text[entity["start"]:entity["end"]]
                self.add_entities_if_synonyms(entity_val,
                                              str(entity.get("value")))

4.5 自定义正则特征加强组件：rasa_nlu/featurizers/regex_featurizer.py

    def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig, **Any) -> None

        # 加载自定义的正则特征：regex.json
        for example in training_data.regex_features:
            self.known_patterns.append(example)

        for example in training_data.training_examples:
            updated = self._text_features_with_regex(example)
            example.set("text_features", updated)

4.6 实体特征向量化组件：rasa_nlu/featurizers/mitie_featurizer.py

    def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig, **Any) -> None

        mitie_feature_extractor = self._mitie_feature_extractor(**kwargs)
        for example in training_data.intent_examples:
            # 构建向量化特征
            features = self.features_for_tokens(example.get("tokens"),
                                                mitie_feature_extractor)
            example.set("text_features",
                        self._combine_with_existing_text_features(
                                example, features))

4.7 意图识别分类器训练组件：在rasa_nlu/classifiers/sklearn_intent_classifier.py

    def train(self, training_data, cfg, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig, **Any) -> None
        """Train the intent classifier on a data set."""
        # 定义线程数，可否增加，会对训练有什么影响？
        num_threads = kwargs.get("num_threads", 1)
        # 获取训练数据中的意图标签
        labels = [e.get("intent")
                  for e in training_data.intent_examples]
        # 意图标签需要至少两类，否则发出警告
        if len(set(labels)) < 2:
            logger.warn("Can not train an intent classifier. "
                        "Need at least 2 different classes. "
                        "Skipping training of intent classifier.")
        else:
            # 将字符串标签用num来表示
            y = self.transform_labels_str2num(labels)

            # 获取one-hot编码的训练数据
            X = np.stack([example.get("text_features")
                          for example in training_data.intent_examples])
            # 创建训练器
            self.clf = self._create_classifier(num_threads, y)
            # 开始训练
            self.clf.fit(X, y)

    def _create_classifier(self, num_threads, y):
        from sklearn.model_selection import GridSearchCV
        from sklearn.svm import SVC
        # 获取参数调节列表，暂定为[1,2,5,10,20,100]
        C = self.component_config["C"]
        # 使用的是线性核：linear
        kernels = self.component_config["kernels"]
        # dirty str fix because sklearn is expecting
        # str not instance of basestr...
        tuned_parameters = [{"C": C,
                             "kernel": [str(k) for k in kernels]}]

        # aim for 5 examples in each fold
        # 每个fold应该要有5个样例
        cv_splits = self._num_cv_splits(y)
        # 返回网格搜索的训练器
        return GridSearchCV(SVC(C=1,
                                probability=True,
                                class_weight='balanced'),
                            param_grid=tuned_parameters,
                            n_jobs=num_threads,
                            cv=cv_splits,
                            scoring='f1_weighted',
                            verbose=1)

    def _num_cv_splits(self, y):
        folds = self.component_config["max_cross_validation_folds"]
        return max(2, min(folds, np.min(np.bincount(y)) // 5))

saibunnsi/vip-chatbot

任务型对话系统

1. Rasa Review

2. 项目搭建

3.模型训练

4.意图分类训练过程详解