Hindi-Aesthetics-Corpus

This corpus consists of novels and short stories written in Hindi language. We scraped novels and stories from http://hindisamay.com, an e-library maintained by Mahatma Gandhi Antarrashtriya Hindi Vishwa Vidyalaya (translated to Mahatma Gandhi International Hindi University), Wardha, http://premchand.co.in, a website dedicated to the popular novelist Premchand’s stories, and Bhandarkar Oriental Research Institute’s Digital Library (http://borilib.com). As a preprocessing step, we split the text into sentences and deleted special characters, English tokens and Latin numbers.

Metadata:

Unique word count: 145,508

Unique lemma count: 118,266

978 articles including novels, short stories and non-fictional texts were collated from the sources mentioned above. Out of these 978 articles, the metadata of 164 articles could not be found. Majority of the work is associated with authors whose native state is Uttar Pradesh, a state in northern India.

State-wise distribution of authors

To cite this work: Wairagade-Venugopal, G., Saini, J. R., & P., Dhanya (2020). Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List. International Journal of Advanced Computer Science and Applications, 11(1).