Deciphering simple substitution ciphers with multi-task training of transformers with language loss
- First clone the repository on Colab. Navigate to destination folder on your machine and run the following in a Colab Notebook
!git clone https://github.com/philipgeorge94/s2s-decipherment-multilingual.git
- If you're not running this on google colab, dependencies will have to be installed on terminal/conda/PyCharm venv etc These include, but are not limited to, the following four.
$ pip install transformers --quiet
$ pip install datasets transformers[SentencePiece] --quiet
$ pip install pyter3 --quiet
$ pip install torchmetrics --quiet
- Open baseline_language_loss.ipynb and run the 4 cells in the 'Dependencies' section
- Other instructions are provided in the Notebook comments in each cell, but in brief:
- Experiment settings - cipher length, model/task type, space encoding scheme - have to be set in initial cell
- Train,Val, and Test data have to be either generated afresh or loaded from ./master_data/ depending on whether the parrticular experiment has been run before
- The remaining cells can be run one after another
Note: Other files may be present in the folders, but those are not strictly necessary for the project
- baseline_language_loss.ipynb contains the main code for experiments and should be the entry point for the user
- preprocessing.ipynb contains code we used for preprocessing the raw corpora
- DataAnalysis.ipynb contains some code we used to analyse the data
- Contains all the .py files used in the main notebook
- data.py contains the CipherDataset() object
- data_utils.py contains util functions for loading and preparing input data
- models.py defines the Deciphormer() transformer model
- preprocess.py contains code used for preprocessing the corpora
- train_test.py contains the main train and validation functions
- Contains the 12 cached files = 2 train_test_splits * 2 cipher_lengths * 3 space_encoding_schemes