List of manually annotated author name disambiguation datasets

1. REXA: https://github.com/tapilab/rexa-coref-data

Aron Culotta and Pallika Kanani and Robert Hall and Michael Wick and Andrew McCallum. Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function. Sixth International Workshop on Information Integration on the Web (IIWeb-07), 2007.

2. Aminer: https://www.aminer.cn/disambiguation

http://arnetminer.org/lab-datasets/disambiguation/rich-author-disambiguation-data.zip

Jie Tang and Alvis C.M. Fong and Bo Wang and Jing Zhang. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. 
IEEE Transactions on Knowledge and Data Engineering, 2012.

3. PubMed: https://github.com/Yonsei-TSMM/author_name_disambiguation

Song, M., Kim, E.H.J., Kim, H.J. Exploring Author Name Disambiguation on PubMed-scale, Journal of Informetrics.

4. PENN: http://clgiles.ist.psu.edu/data/nameset_author-disamb.tar.zip

Han, H., Zha, H. Y., & Giles, C. L. (2005). Name disambiguation spectral in author citations using a K-way clustering method.
Proceedings of the 5th Acm/Ieee Joint Conference on Digital Libraries, Proceedings, 334-343

5. QIAN: https://github.com/yaya213/DBLP-Name-Disambiguation-Dataset

Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. (2015). Dynamic author name disambiguation for growing digital libraries.
Information Retrieval Journal, 18(5), 379-412

The data consist of 574 ambiguous name groups in 6,783 name record instances. These data were originally created by combining
other labeled data (including PENN and AMINER), and de-duplicated and corrected for errors

6. KISTI: http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/DBLP.tar.gz/at_download/file

Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a large-scale test set for author disambiguation.
Information Processing & Management, 47(3), 452-465.

Another labeled data were created by researchers at the Korea Institute of Science and Technology Information in collaboration with the Kyungsung University in Korea. The KISTI data are a collection of 41,673 name record instances extracted from 37,613 DBLPindexed publications. A total of 6,921 unique authors were identified by manual disambiguation exploiting web query results from Google.

7. MEDLINE: https://github.com/amorgani/AND

Dina Vishnyakova, Raul Rodriguez-Esteban, Fabio Rinaldi, A new approach and gold standard toward author disambiguation in MEDLINE, Journal of the American Medical Informatics Association.

8. zbMATH: https://zenodo.org/record/161333#.Xmn9B6gzY2w

This data set contains disambiguated publication data from zbMATH (www.zbmath.org) for use in author name disambiguation (AND). It covers 28321 publications with 33810 authorship records, authored by 2946 distinct authors. Authorship records have been manually annotated with author identifiers.

Mark-Christoph Müller, Florian Reitz, and Nicolas Roy (2017): "Data Sets for Author Name Disambiguation: An Empirical Analysis and a New Resource", Scientometrics.