gregorycrane/chiron

Chiron - tool for aligning pre-modern and literary texts with translations in multiple languages

Jupyter Notebook

Chiron

Chiron is a tool for aligning pre-modern and literary texts with translations in multiple languages.

Chironata (in progress)

Create an annotated dataset using LaBSE and Vecalign.
Code and data saved in chironata

Pipeline models

LaBSE, Feng et al. (2020)

For embedding sentences
Associated file: build_labse_embeds.py, using Hugging Face implementation
Input: text file to embed; output: LaBSE embeddings of 768 dimensions in binary file.
LaBSE paper: Feng, F., Yang, Y., Cer, D.M., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT Sentence Embedding. Annual Meeting of the Association for Computational Linguistics.

Vecalign, Thompson (2019)

For aligning two texts embedded at the sentence level
Associated files: overlap.py, vecalign.py, score.py
Vecalign GitHub: https://github.com/thompsonb/vecalign
Vecalign paper: Thompson, B. (2019). Vecalign: Improved Sentence Alignment in Linear Time and Space. Conference on Empirical Methods in Natural Language Processing.

Pipeline steps

Build overlaps files (from vecalign) for source text and translation

File to run: overlap.py
Input: source text or translation segmented at the sentence level
Output: "concatenations of consecutive sentences" as explained on Vecalign's GitHub.

Build LaBSE embeddings of the overlaps files

File to run: build_labse_embeds.py
Input: overlaps text file
Output: LaBSE embeddings in binary file of 768 dimensions, with 1 embedding per sentence concatenation in overlaps file

Align source text and translation (from vecalign)

File to run: vecalign.py
Input: LaBSE embeddings
Output: sentence alignments written to stdout. For a detailed description of the results' format, see Vecalign's GitHub.

Evaluation

Using sentence-level ground truth

File to run: score_all.py
Includes three scoring functions:
- Vecalign's original strict scores (Precision, Recall, F1). Does not include Vecalign's original lax scores.
- Chiron's new lax scores (Precision, Recall, F1)
- Chiron's new strict score (Accuracy only)

Chapter-level evaluation if sentence-level ground truth not available

Example file: score_vec_rslts_chapter_level.ipynb
Example based on aligning Thucydides' The Peloponnesian War against a French translation

Testing Chiron

Caroline Craig, Kartik Goyal, Gregory R. Crane, Farnoosh Shamsian, and David A. Smith. Testing the limits of neural sentence alignment models on classical Greek and Latin texts and translations. In Computational Humanities Research Conference (CHR), 2023. PDF
Code and data available in align_texts_projects

Installation

To use LaBSE, see instructions on Hugging Face
To use Vecalign, see list of dependencies on Vecalign's GitHub