LAGOS-AND: A Python repository from carmanzhang

Dataset

The dataset can be available from here, it is represented in two kinds of forms. The first one is full-name-block form, which arranges the dataset by ORCID iD and shared full name. The second is pairwise form, ambiguous authors are arranged in pair so that some classifier-based models can use it to capture the similarity between two authors.

Our dataset v.s. Existing datasets

Creating a new dataset is painful. In AND researches, all existing datasets created by human annotators, and most datasets are either in limited scale or biased. However, our dataset has overcome these problems. It does not need human interventions in building the dataset. Moreover, by using the two comprehensive resources, the publishing history of a specific author (query DOIs by ORCID iD) and authors of a specific paper (query ORCID iDs by DOI) can be easily and credibly identified. Thus, with the large number of records in the credible resources, a large-scale dataset can be built. More importantly, the dataset considers more realistic aspects than existing datasets. It passed a series of rigorous gold standard validations, among which the two most important ones are synonym patterns and domains. The dataset contains a similar variation degree in last names and covers wide domain of research areas, as that represented in entire MAG.

Dataset Structure

The block-based dataset contains the following fields:

Field	Date Type
block_fullname	String
author_group_orcid	String
author_group_idx_in_block	Int
citation_idx_in_author_group	Int
doi	String
pid	Int
author_position	Int
author_name	String
author_affiliation	String
coauthors	String Array
coauthor_affliations	String Array
venue	String
pub_year	Int
paper_title	String
paper_abstract	String

"block_fullname" is taken from the credible full name (CFN) from the ORCID system, it is used to represent the block. Due to the fact that more than one authors can exist in a block, "author_group_orcid" is the ORCID iD of a specific author in a block, it is used to represent a group of citations (CG) that authored by this author, and "author_group_idx_in_block" denotes the order of CGs in a block. Similarly, "citation_idx_in_author_group" denotes the order of citation in a CG. "pid" is the paper ID in Microsoft Academic and Microsoft Academic Graph. "author_position" is identified by heuristics.

Citation

If you used the dataset, method or model, please consider cite it.

@article{zhang2021lagos,
    title={LAGOS-AND: A Large, Gold Standard Dataset for Scholarly Author Name Disambiguation},
    author={Zhang, Li and Lu, Wei and Yang, Jinqing},
    journal={arXiv preprint arXiv:2104.01821},
    year={2021}
}

carmanzhang/LAGOS-AND

Dataset

Our dataset v.s. Existing datasets

Dataset Structure

Citation