hlp-ai/mt-data

MT Data

PythonApache-2.0

mt-data

Massively Crawling and extracting bitext from Web

How to use

Given CommonCrawl archive ID, get WET file list.
Download WET file, unzip it, extract metadta, and dump languge info for each page in WET files.
Calculate the lengths of text of different languages for each domain.
Get multilingual domains for given multiple languages.