/mt-data

MT Data

Primary LanguagePythonApache License 2.0Apache-2.0

mt-data

Massively Crawling and extracting bitext from Web

How to use

  1. Given CommonCrawl archive ID, get WET file list.
  2. Download WET file, unzip it, extract metadta, and dump languge info for each page in WET files.
  3. Calculate the lengths of text of different languages for each domain.
  4. Get multilingual domains for given multiple languages.