This is a collection of Malayalam content collected from various sources and then curated and processed for general purpose usage.
The text corpus contains running text from various free licensed sources.
- The whole content of Malayalam Wikipedia extracted on January 1, 2019
- News/Article from various sources, source mentioned in respective files:
- 251 Mb
- 8,60,159 lines
- 98,15,533 words
- 10,11,11,885 characters
The word corpus contains
- Classified lexicon prepared for Malaylam Morphology Analyser project
- Unique words extracted from Malayalam Wikipedia, Wictionary etc.
- 14,27,392 words
- If you know or have a text collection with compatible license(CC by SA), we can add that to this collection. Just create an issue and let us know about it. We will help. We are looking for content in diverse topics.
- We are also collecting person names, place names etc in Malayalam. You can see the existing words by just browsing to the words folder. If you like to expand that collection, create an issue with details or create a merge request.
Make sure to respect the copyright of the content. We are trying to provide a corpus of free licensed content.
- Malayalam content from Common Crawl dataset- https://github.com/qburst/common-crawl-malayalam
Creative Commons Attribution-ShareAlike https://creativecommons.org/licenses/by-sa/3.0/