
IndicLink is a Multilingual Fact Linking (MFL) dataset of sentences and a set of WikiData facts (subject; relation; object) contained in each sentence. IndicLink contains sentences from English and 6 Indian languages - Hindi, Telugu, Tamil, Urdu, Gujarati and Assamese. The correct facts are chosen from an oracle of 4.7 million Wikidata facts with fact labels/descriptions available in these 7 languages. The dataset is intended only to act as a test set to evaluate models trained for the task of MFL. For more details, please see https://arxiv.org/abs/2109.14364


IndicLink - KG Fact Linking Evaluation Dataset for Indian Languages

IndicLink is a KG Fact Linking dataset that has sentences and their corresponding linked KG facts from WikiData. The dataset is collected as part of the publication - Multilingual Fact Linking (AKBC'21). The sentences are present in English and six Indian languages - Hindi, Telugu, Tamil, Urdu, Gujarati and Assamese. The descriptions of language-agnostic WikiData facts are also shared for the given languages. The English sentences and their facts are taken from the test set of WebRED and the sentences are translated into the different Indian languages by using the services of professional translators.

If you use this dataset, please cite:

    title={Multilingual Fact Linking}, 
    author={Keshav Kolluru and Martin Rezk and Pat Verga and William Cohen and Partha Talukdar},
    journal={Automated Knowledge Base Construction (AKBC)}

The distribution of examples across different languages are as follows:

IndicLink English Hindi Telugu Tamil Urdu Gujrati Assamese
#Test Examples 1002 889 888 881 1001 881 887
#KG Facts 4.6M 230K 145K 248K 361K 91K 257

The oracle set of 4.6 million KG facts correspond to all facts that occur in the top-1 million most frequent WikiData entities. Their textual descriptions are always available in English but only sparsely available in other languages.


This data is licensed by Google LLC under a Creative Commons Attribution 4.0 International License. Users will be allowed to modify and repost it, and we encourage them to analyze and publish research based on the data.