/entity-synthetic-dataset

Documentation on how to access and use the entity tts synthetic dataset

MIT LicenseMIT

Entity Synthetic Dataset

The Entity Synthetic Dataset is a multi-speaker multi-locale (en-*) TTS synthetic dataset for entities collected from NELL and Yago for paper "Towards Contextual Spelling Correction for Customization of End-to-end Speech Recognition Systems".

Please use the dataset for research or non-commercial purpose.

Get the data

The dataset is available both on OneDrive and BaiduCloud with scripts in txt files and synthetic audio in zip files. Please select either the resource according to your convenience.

Changes

August 2022: update entity synthetic dataset and examples.

References

[1] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling, “Neverending learning,” in Proc. AAAI, 2015.

[2] T. P. Tanon, G. Weikum, and F. Suchanek, “Yago 4: A reasonable knowledge base,” in Extended Semantic Web Conference, 2020.