Existing Knowledge Graph based Multilingual Question Answering (KG-MLQA) works mainly focus on the semantic parsing of multilingual questions, but ignore the combination of multilingual knowledge, which makes the QA system fail to break through the limitation of monolingual resources and has no potential to cover all of questions. Through a semiautomatic template synthesis process, we present MLPQ, a parallel path question answering dataset based on bilingual knowledge graphs extracted from DBpedia, which contains 827k questions and covers three language pairs (Chinese/English, Chinese/French, and English/French). Each question in MLPQ includes two or three relations, and requires the integration of information from bilingual knowledge graphs. Based on the MLPQ, we propose the first QA task over multilingual KGs, named Cross-lingual Path Question (CLPQ). The popular path question answering and multiple knowledge question answering (QA) models are used to establish two baselines of MLPQ. Experiments show that existing QA models cannot precisely respond to CLPQ. This work may further promote the development of Multilingual KGQA and information retrieval.
There are a total of 827k questions in MLPQ, which covers three language pairs (Chinese/English, Chinese/French, and English/French), and requires a 2-hop or 3-hop cross-lingual path inference to answer each question.
We establish MLPQ through a four-step semi-automatic process:
- Triple pairs selection: obtain the candidate triple pairs (2-hop and 3-hop) based on the Inter Language Links(ILLs) of DBpedia;
- Construction of templates: build single-hop templates and synthesize them into multi-hop templates;
- Diversity: increase the template diversity by paraphrases;
- The building of Questions: generate questions by adding topic entities into templates.
For more detailed explanation, please refer to our paper.
Number of triple pairs extracted from Dpedia to generate questions of CLPQ (Top-200 relations of each language):
Language pair | Direction | 2-hop | 3-hop |
---|---|---|---|
en-zh | en→zh | 2743557 | 5022783 |
zh→en | 9415 | 32895 | |
zh-fr | zh→fr | 8618 | 20443 |
fr→zh | 506695 | 769711 | |
en-fr | en→fr | 533786 | 1708099 |
fr→en | 10816641 | 4868011 |
Statistics of each subset of MLPQ, ”Q” means ”questions”, ”Lan” means ”language”:
Lan. pair | type | Q.Lan | #Q | #Ent | #Rel |
---|---|---|---|---|---|
en-zh | 2-hop | en | 65741 | 49016 | 89 |
zh | 129535 | 49225 | 89 | ||
3-hop | en | 97107 | 68143 | 198 | |
zh | 85167 | 21368 | 147 | ||
zh-fr | 2-hop | zh | 59650 | 22690 | 86 |
fr | 27850 | 22532 | 86 | ||
3-hop | zh | 34796 | 12938 | 146 | |
fr | 35076 | 30732 | 189 | ||
en-fr | 2-hop | en | 71605 | 52913 | 77 |
fr | 49061 | 53012 | 77 | ||
3-hop | en | 93249 | 62988 | 185 | |
fr | 78229 | 72505 | 172 |
- The datasets are available in two formats. One is in RDF format, the other is in a custom format similar to the datasets used in IRN.
- All the datasets are in the datasets directory. For explanation of file naming convensions and our custom format, please refer to this directory for further information.
- We establish two baseline models of MLPQ based on the popular multi-hop reasoning model IRN and multiple KGQA model, combined with a representative Cross-ingual Entity Aligment (CLEA) model.
- The two baselines are called MIRN and CL-MKQA respectively.
- Baseline codes are in the baselines directory. To try these baselines, please refer to this directory for further information.
In this slightly improved version, we corrected many grammatical errors and added the RDF version of all the datasets.
- Currently the MLPQ version is
1.1
. We expect to further the work and provide datasets of higher quality and more variety in the future. - Because the generation of MLPQ is semi-automatic and relys on manually crafted templates and machine translation to some degree, there might be some minor problems in the text. We try to improve the quality of MLPQ by post-editing and there should be very few problems now. However, if you find any errors in the dataset, please contact us, thanks.
For now, MLPQ mainly contains 2-hop and 3-hop path questions. In the future, we plan to adopt retelling generation based on web resources to create a greater abundance of question expressions. The path question is merely one subset of complex questions; we also plan to update and augment factoriented complex questions with property information and to explore aggregate-typed complex questions.
This project is licensed under the GPL3 License - see the LICENSE file for details