HPLT - High Performance Language Technologies
A space that combines petabytes of natural language data with large-scale model training
Pinned Repositories
data-analytics-tool
Data Analytics Tool
HPLT-MT-Models
This contains the configuration and scripts for HPLT MT model releases.
ia-download
Internet archive downloader
monolingual-multilingual-instruction-tuning
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
monotextor-slurm
Set of scripts to run monotextor-like pipeline under slurm HPCs
OpusCleaner
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
OpusPocus
Marian machine translation training pipeline for thousands of models
OpusTrainer
Curriculum training
sacremoses
Python port of Moses tokenizer, truecaser and normalizer
warc2text-runner
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
HPLT - High Performance Language Technologies's Repositories
hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
hplt-project/OpusCleaner
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
hplt-project/OpusTrainer
Curriculum training
hplt-project/data-analytics-tool
Data Analytics Tool
hplt-project/monolingual-multilingual-instruction-tuning
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
hplt-project/HPLT-MT-Models
This contains the configuration and scripts for HPLT MT model releases.
hplt-project/warc2text-runner
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
hplt-project/ia-download
Internet archive downloader
hplt-project/monotextor-slurm
Set of scripts to run monotextor-like pipeline under slurm HPCs
hplt-project/OpusPocus
Marian machine translation training pipeline for thousands of models
hplt-project/bitextor-mt-models
hplt-project/HPLT-WP4
Information and pipelines on WP4: language models training
hplt-project/MT-winterschool-2023
hplt-project/release2_inspection
hplt-project/document-aligner
tf/idf-based document aligner from Bitextor
hplt-project/bitextor-slurm
Scripts for running bitextor jobs
hplt-project/cc-download
hplt-project/clianer
A lightweight command-line frontend to OpusCleaner
hplt-project/OPUS-MT-dashboard
hplt-project/OpusFilter
OpusFilter - Parallel corpus processing toolkit
hplt-project/paracrawl-dashboard
Make-shift interface for managing Paracrawl processing and exploring its outputs