/MatchZoo

MatchZoo is a toolkit for text matching. It was developed to facilitate the designing, comparing, and sharing of deep text matching models.

Primary LanguagePythonApache License 2.0Apache-2.0

图片名称

Build Status MatchZoo 2.0

MatchZoo is a toolkit for text matching. It was developed with a focus on facilitating the design, comparison and sharing of deep text matching models. There are a number of deep matching methods, such as DRMM, MatchPyramid, MV-LSTM, aNMM, DUET, ARC-I, ARC-II, DSSM, and CDSSM, designed with a unified interface (collection of papers: awesome-neural-models-for-semantic-match). Potential tasks related to MatchZoo include document retrieval, question answering, conversational response ranking, paraphrase identification, etc. We are always happy to receive any code contributions, suggestions, comments from all our MatchZoo users.

Tasks Text 1 Text 2 Objective
Paraphrase Indentification string 1 string 2 classification
Textual Entailment text hypothesis classification
Question Answer question answer classification/ranking
Conversation dialog response classification/ranking
Information Retrieval query document ranking

We're actively developing MatchZoo 2.0 with everything improved, stay tuned! See branch 2.0 for early access.

Installation

MatchZoo is still under development. Before the first stable release (1.0), please clone the repository and run

git clone https://github.com/NTMC-Community/MatchZoo.git
cd MatchZoo
python setup.py install

In the main directory, this will install the dependencies automatically.

For usage examples, you can run

python matchzoo/main.py --phase train --model_file examples/toy_example/config/arci_ranking.config
python matchzoo/main.py --phase predict --model_file examples/toy_example/config/arci_ranking.config

Overview

The architecture of the MatchZoo toolkit is described in the Figure in what follows,

图片名称

There are three major modules in the toolkit, namely data preparation, model construction, training and evaluation, respectively. These three modules are actually organized as a pipeline of data flow.

Data Preparation

The data preparation module aims to convert dataset of different text matching tasks into a unified format as the input of deep matching models. Users provide datasets which contains pairs of texts along with their labels, and the module produces the following files.

  • Word Dictionary: records the mapping from each word to a unique identifier called wid. Words that are too frequent (e.g. stopwords), too rare or noisy (e.g. fax numbers) can be filtered out by predefined rules.
  • Corpus File: records the mapping from each text to a unique identifier called tid, along with a sequence of word identifiers contained in that text. Note here each text is truncated or padded to a fixed length customized by users.
  • Relation File: is used to store the relationship between two texts, each line containing a pair of tids and the corresponding label.
  • Detailed Input Data Format: a detailed explaination of input data format can be found in MatchZoo/data/toy_example/readme.md.

Model Construction

In the model construction module, we employ Keras library to help users build the deep matching model layer by layer conveniently. The Keras libarary provides a set of common layers widely used in neural models, such as convolutional layer, pooling layer, dense layer and so on. To further facilitate the construction of deep text matching models, we extend the Keras library to provide some layer interfaces specifically designed for text matching.

Moreover, the toolkit has implemented two schools of representative deep text matching models, namely representation-focused models and interaction-focused models [Guo et al.].

Training and Evaluation

For learning the deep matching models, the toolkit provides a variety of objective functions for regression, classification and ranking. For example, the ranking-related objective functions include several well-known pointwise, pairwise and listwise losses. It is flexible for users to pick up different objective functions in the training phase for optimization. Once a model has been trained, the toolkit could be used to produce a matching score, predict a matching label, or rank target texts (e.g., a document) against an input text.

Benchmark Results:

Here, We adopt two representative datasets for examples to show the usage of MatchZoo for ranking and classification. For ranking task, we use WikiQA dataset as an example. For classification task, we use QuoraQP dataset as an example.

WikiQA for Ranking

WikiQA is a popular benchmark dataset for answer sentence selection in question answering. We have provided a script to download the dataset, and prepared it into the MatchZoo data format. In the models directory, there are a number of configurations about each model for WikiQA dataset.

Take the DRMM as an example. In training phase, you can run

python matchzoo/main.py --phase train --model_file examples/wikiqa/config/drmm_wikiqa.config

In testing phase, you can run

python matchzoo/main.py --phase predict --model_file examples/wikiqa/config/drmm_wikiqa.config

We have compared 10 models, the results are as follows.

Models NDCG@3 NDCG@5 MAP
DSSM 0.5439 0.6134 0.5647
CDSSM 0.5489 0.6084 0.5593
ARC-I 0.5680 0.6317 0.5870
ARC-II 0.5647 0.6176 0.5845
MV-LSTM 0.5818 0.6452 0.5988
DRMM 0.6107 0.6621 0.6195
K-NRM 0.6268 0.6693 0.6256
aNMM 0.6160 0.6696 0.6297
DUET 0.6065 0.6722 0.6301
MatchPyramid 0.6317 0.6913 0.6434
DRMM_TKS 0.6458 0.6956 0.6586
The loss of each models in train dataset are described in the following figure,
图片名称

The MAP of each models in test dataset are depicted in the following figure,

图片名称
Here, the DRMM_TKS is a variant of DRMM for short text matching. Specifically, the matching histogram is replaced by a top-k maxpooling layer and the remaining part are fixed.

QuoraQP for Classification

QuoraQP (Quora Question Pairs) is a text matching competition from kaggle, which is to predict whether the provided pair of questions have the same meaning. We have provided a script to download the dataset, and prepared it into the MatchZoo data format. In the models directory, there are a number of configurations about each model for QuoraQP dataset.

Take the MatchPyramid as an example. In training phase, you can run

python matchzoo/main.py --phase train --model_file examples/QuoraQP/config/matchpyramid_quoraqp.config

In testing phase, you can run

python matchzoo/main.py --phase predict --model_file examples/QuoraQP/config/matchpyramid_quoraqp.config

The loss of each models in train dataset are described in the following figure,

图片名称

The precisioin of each models in test dataset are depicted in the following figure,

图片名称

Model Detail:

  1. DRMM

this model is an implementation of A Deep Relevance Matching Model for Ad-hoc Retrieval.

  • model file: models/drmm.py
  • model config: models/drmm_ranking.config

  1. MatchPyramid

this model is an implementation of Text Matching as Image Recognition

  • model file: models/matchpyramid.py
  • model config: models/matchpyramid_ranking.config

  1. ARC-I

this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences

  • model file: models/arci.py
  • model config: models/arci_ranking.config

  1. DSSM

this model is an implementation of Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

  • model file: models/dssm.py
  • model config: models/dssm_ranking.config

  1. CDSSM

this model is an implementation of Learning Semantic Representations Using Convolutional Neural Networks for Web Search

  • model file: models/cdssm.py
  • model config: models/cdssm_ranking.config

  1. ARC-II

this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences

  • model file: models/arcii.py
  • model config: models/arcii_ranking.config

  1. MV-LSTM

this model is an implementation of A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations

  • model file: models/mvlstm.py
  • model config: models/mvlstm_ranking.config

  1. aNMM

this model is an implementation of aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model

  • model file: models/anmm.py
  • model config: models/anmm_ranking.config

  1. DUET

this model is an implementation of Learning to Match Using Local and Distributed Representations of Text for Web Search

  • model file: models/duet.py
  • model config: models/duet_ranking.config

  1. K-NRM

this model is an implementation of End-to-End Neural Ad-hoc Ranking with Kernel Pooling

  • model file: models/knrm.py
  • model config: models/knrm_ranking.config

  1. CONV-KNRM:

this model is an implementation of Convolutional neural networks for soft-matching n-grams in ad-hoc search

  • model file: models/convknrm.py
  • model config: models/convknrm.config

  1. models under development:

Match-SRNN, DeepRank ....

Citation

@article{fan2017matchzoo,
  title={MatchZoo: A Toolkit for Deep Text Matching},
  author={Fan, Yixing and Pang, Liang and Hou, JianPeng and Guo, Jiafeng and Lan, Yanyan and Cheng, Xueqi},
  journal={arXiv preprint arXiv:1707.07270},
  year={2017}
}

Project Organizers

  • Jiafeng Guo
    • Institute of Computing Technolgy, Chinese Academy of Sciences
    • HomePage
  • Yanyan Lan
    • Institute of Computing Technolgy, Chinese Academy of Sciences
    • HomePage
  • Xueqi Cheng
    • Institute of Computing Technolgy, Chinese Academy of Sciences
    • HomePage

Environment

  • python2.7+
  • tensorflow 1.2+
  • keras 2.06+
  • nltk 3.2.2+
  • tqdm 4.19.4+
  • h5py 2.7.1+

Development Teams

  • Yixing Fan
    • Institute of Computing Technolgy, Chinese Academy of Sciences
    • Google Scholar
  • Liang Pang
    • Institute of Computing Technolgy, Chinese Academy of Sciences
    • Google Scholar
  • Liu Yang
    • Center for Intelligent Information Retrieval, University of Massachusetts Amherst
    • HomePage

Acknowledgements

We would like to express our appreciation to the following people for contributing source code to MatchZoo, including Yixing Fan, Liang Pang, Liu Yang, Wang Bo, Yukun Zheng, Lijuan Chen, Jianpeng Hou, Zhou Yang, Niuguo cheng etc..

Feedback and Join Us

Feel free to post any questions or suggestions on GitHub Issues and we will reply to your questions there. You can also suggest adding new deep text maching models into MatchZoo and apply for joining us to develop MatchZoo together.

Update in 12/10/2017: We have applied another WeChat ID: CLJ_Keep. Anyone who want to join the WeChat group can add this WeChat id as a friend. Please tell us your name, company or school, city when you send such requests. After you added "CLJ_Keep" as one of your WeChat friends, she will invite you to join the MatchZoo WeChat group. "CLJ_Keep" is one member of the MatchZoo team.

Update in 04/07/2018: We have created a Google discussion group MatchZoo Discuss to better support Q&A discussions of our users. You can post any questions/suggestions on the MatchZoo toolkit here. The developers and other experienced users from our community will reply to your questions.