Pinned Repositories
cirrus-scripts
Scripts for running bitextor/paracrawl/europat jobs on cirrus.ac.uk
corset
Corset is a web-based data selection portal that helps you getting relevant data from massive amounts of parallel data.
DataCollection
Data collection, alignment and TAUS repository
embedding
Mine parallel corpora with embeddings
europat-scripts
Scripts for obtaining patent data
extractor
human-evaluations
Results of the human evaluation
keops
Tool for manual evaluation of parallel sentences.
synthesis
Data synthesis by contextualizing glossary translations
tmxutil
Tools to generate & filter Europat tmx files.
ParaCrawl's Repositories
paracrawl/extractor
paracrawl/corset
Corset is a web-based data selection portal that helps you getting relevant data from massive amounts of parallel data.
paracrawl/keops
Tool for manual evaluation of parallel sentences.
paracrawl/DataCollection
Data collection, alignment and TAUS repository
paracrawl/cirrus-scripts
Scripts for running bitextor/paracrawl/europat jobs on cirrus.ac.uk
paracrawl/synthesis
Data synthesis by contextualizing glossary translations
paracrawl/human-evaluations
Results of the human evaluation
paracrawl/embedding
Mine parallel corpora with embeddings
paracrawl/europat-scripts
Scripts for obtaining patent data
paracrawl/tmxutil
Tools to generate & filter Europat tmx files.
paracrawl/giashard
Sharding program for Paracrawl
paracrawl/opus-train
Automate download and training with OPUS corpora
paracrawl/b64filter
Program for operating on one document per Base 64 encoded line files
paracrawl/Domain_Adaptation
InDomain detection is a tool designed to extract in-domain data from a large collections of data.
paracrawl/giawarc
Processing utilities for Internet Archive
paracrawl/targeted-crawling
paracrawl/corpus-issues
Open here any Paracrawl corpus related issue
paracrawl/go-warc
A golang library to work with WARC files from the common crawl
paracrawl/multilingual-ted
paracrawl/url_language_analysis