Chiron is a tool for aligning pre-modern and literary texts with translations in multiple languages.
- Create an annotated dataset using LaBSE and Vecalign.
- Code and data saved in chironata
- LaBSE, Feng et al. (2020)
- For embedding sentences
- Associated file: build_labse_embeds.py, using Hugging Face implementation
- Input: text file to embed; output: LaBSE embeddings of 768 dimensions in binary file.
- LaBSE paper: Feng, F., Yang, Y., Cer, D.M., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT Sentence Embedding. Annual Meeting of the Association for Computational Linguistics.
- Vecalign, Thompson (2019)
- For aligning two texts embedded at the sentence level
- Associated files: overlap.py, vecalign.py, score.py
- Vecalign GitHub: https://github.com/thompsonb/vecalign
- Vecalign paper: Thompson, B. (2019). Vecalign: Improved Sentence Alignment in Linear Time and Space. Conference on Empirical Methods in Natural Language Processing.
- Build overlaps files (from vecalign) for source text and translation
- File to run: overlap.py
- Input: source text or translation segmented at the sentence level
- Output: "concatenations of consecutive sentences" as explained on Vecalign's GitHub.
- Build LaBSE embeddings of the overlaps files
- File to run: build_labse_embeds.py
- Input: overlaps text file
- Output: LaBSE embeddings in binary file of 768 dimensions, with 1 embedding per sentence concatenation in overlaps file
- Align source text and translation (from vecalign)
- File to run: vecalign.py
- Input: LaBSE embeddings
- Output: sentence alignments written to stdout. For a detailed description of the results' format, see Vecalign's GitHub.
- File to run: score_all.py
- Includes three scoring functions:
- Vecalign's original strict scores (Precision, Recall, F1). Does not include Vecalign's original lax scores.
- Chiron's new lax scores (Precision, Recall, F1)
- Chiron's new strict score (Accuracy only)
- Example file: score_vec_rslts_chapter_level.ipynb
- Example based on aligning Thucydides' The Peloponnesian War against a French translation
- Caroline Craig, Kartik Goyal, Gregory R. Crane, Farnoosh Shamsian, and David A. Smith. Testing the limits of neural sentence alignment models on classical Greek and Latin texts and translations. In Computational Humanities Research Conference (CHR), 2023. PDF
- Code and data available in align_texts_projects
- To use LaBSE, see instructions on Hugging Face
- To use Vecalign, see list of dependencies on Vecalign's GitHub