smc/corpus

Malayalam Corpus by Swathanthra Malayalam Computing

sed

Malayalam Corpus by Swathanthra Malayalam Computing

This is a collection of Malayalam content collected from various sources and then curated and processed for general purpose usage.

Contents (As on March 4, 2019)

The text corpus contains running text from various free licensed sources.

The whole content of Malayalam Wikipedia extracted on January 1, 2019
News/Article from various sources, source mentioned in respective files:
251 Mb
8,60,159 lines
98,15,533 words
10,11,11,885 characters

The word corpus contains

Classified lexicon prepared for Malaylam Morphology Analyser project
Unique words extracted from Malayalam Wikipedia, Wictionary etc.
14,27,392 words

Contributing

If you know or have a text collection with compatible license(CC by SA), we can add that to this collection. Just create an issue and let us know about it. We will help. We are looking for content in diverse topics.
We are also collecting person names, place names etc in Malayalam. You can see the existing words by just browsing to the words folder. If you like to expand that collection, create an issue with details or create a merge request.

Make sure to respect the copyright of the content. We are trying to provide a corpus of free licensed content.

Other sources

Malayalam content from Common Crawl dataset- https://github.com/qburst/common-crawl-malayalam

License

Creative Commons Attribution-ShareAlike https://creativecommons.org/licenses/by-sa/3.0/