Pinned Repositories
bicleaner
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
bicleaner-ai
Bicleaner fork that uses neural networks
bifixer
Tool to fix bitexts and tag near-duplicates for removal
biroamer
Utility that will help you to ROAM (Random Omit Anonymize and Mix) your parallel corpus.
bitextor
Bitextor generates translation memories from multilingual websites
bleualign-cpp
monocleaner
neural-document-aligner
Document aligner which uses neural technologies to search matches across bilingual documents
pdf-extract
PDF parser and converter to HTML
warc2text
Extracts plain text, language identification and more metadata from WARC records
Bitextor Team's Repositories
bitextor/bitextor
Bitextor generates translation memories from multilingual websites
bitextor/bicleaner
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
bitextor/pdf-extract
PDF parser and converter to HTML
bitextor/bicleaner-ai
Bicleaner fork that uses neural networks
bitextor/bifixer
Tool to fix bitexts and tag near-duplicates for removal
bitextor/warc2text
Extracts plain text, language identification and more metadata from WARC records
bitextor/biroamer
Utility that will help you to ROAM (Random Omit Anonymize and Mix) your parallel corpus.
bitextor/bleualign-cpp
bitextor/monocleaner
bitextor/neural-document-aligner
Document aligner which uses neural technologies to search matches across bilingual documents
bitextor/bicleaner-data
Repository for data models, dictionaries and more resources for Bicleaner
bitextor/monotextor
bitextor/python-pdfextract
Python interface to pdf-extract, HTML extraction from PDF
bitextor/bicleaner-ai-data
Repository of Bicleaner AI models
bitextor/bitextor-data
Repository for data models, dictionaries and more resources for Bitextor
bitextor/bicleaner-hardrules
Pre-filtering step for bicleaner
bitextor/bitextor-neural
Bitextor Neural generates translation memories from multilingual websites using state-of-the-art Machine Learning tools
bitextor/prevertical2text
Extracts plain text, language identification and more metadata from Spiderling prevertical files
bitextor/vecalign
Improved Sentence Alignment in Linear Time and Space
bitextor/loomchild-segment-py
Python module to interface with Java Loomchild sentence segmenter
bitextor/monocleaner-data
Monocleaner models repository
bitextor/bicleaner-ai-glove
Fork of glove-python to distribute binary builds
bitextor/bitextor-testing-output
Repository for storing testing outputs from Bitextor
bitextor/cld2
Compact Language Detector 2
bitextor/deferred-crawling
Reconstructs sentences using deferred crawling standoff annotations from Bitextor
bitextor/fastText
Library for fast text representation and classification.
bitextor/hunalign
Sentence aligner
bitextor/python-apachetika
Python interface to Apache Tika, HTML extraction from PDF
bitextor/scrawl
Playwright-based web crawler