/cholloadai-2021

Cholloadai corpus and model details

GNU General Public License v3.0GPL-3.0

சொல்லோடை-2021 (Cholloadai-2021)

Cholloadai-2021.txt

cholloadai-2021.txt in archive.org The corpus contains more than 72 million lines of tamil phrases. To ease the download and processing in low power computers the corpus is split into 72 files with a million lines in each.

Model

Tamil language model - skipgram cholloadai branch can handle larger dataset using pytorch dataloader utilities and loads data directly from disk using mmap instead of loading the whole dataset into main memory (RAM)

Scraping pipleline

Text Preprocessing