This repository contains the experiments for validating compressibility-based language complexity metrics over indigenous South American languages. We use parallel data from the Bible and divide our analysis in two subsets: languages for wich we have at least 90% of Bible verses (d90) and the set of common verses for all languages (dall).
Here you will find the metrics implementations, the complexity values obtained computing the metrics, and the notebooks to perform metric validation from complexity values.
We are not making available the source text in indigenous languages, because the dataset is proprietary and we do not have permission to share. However, you can use the code to reproduce our experiments for your own dataset and to reproduce our analysis over the complexities metrics data we computed over the original texts.
- notebooks contains the necessary notebooks for processing the dataset and analyzing the metric values obtained
- src contains the source code.
- src/experiments.py computes the experiments over the given dataset
- requirements.txt requirements necessary to run the notebooks and the programs in src
- shell.nix nix-shell setup (alternative for requirements.txt)
- results The metric values obtained from our experiments with the original data that you can use to reproduce our analysis using ./notebooks/Propositions.ipynb.
- Install requirements
- Put your data in a ./dataset directory
- Run the notebook: Create_Dataset.ipynb (fix path to your dataset)
- Run the experiments program:
python src/experiments.py filename encoding percent runs seed output
- Copy and adapt (or change) ./notebooks/Propositions.ipynb to use the results from the previous step