This repository provides a variety of regular languages of varying types in order to provide a benchmark for Machine Learning models and to help better understand the kind of sequential patterns neural networks in particular are able to learn successfuly and under what conditions. Some motivation is provided in Avcu et al. (2017) paper "Subregular Complexity and Deep Learning". The regular languages themselves are based on the Subregular Hierarchies of languages (see Rogers and Pullum 2011 and Rogers et al. 2013).
- Python >= 3.6
- Tensorflow >= 2.4.0
- Pynini == 2.1.2
The workflow of this codebase was tested on Linux systems using Conda, available through the Anaconda toolkit here. Miniconda will also work. The entirety of the workflow should be carried out in a Conda environment (explained below).
First, install Conda. Next, download this repository with the green 'Code' button or via git
:
git clone https://github.com/heinz-jeffrey/subregular-learning.git
Create a new environment <environment_name> (call it subreg
, for example) like below, and activate the environment. The environment can be closed when you're finished with conda deactivate
.
conda create -n <environment_name>
conda activate <environment_name>
cd
to the root of this project's directory, and install the dependencies in your Conda environment:
pip install -r requirements.txt
Finally, install pynini
via conda-forge
(only version 2.1.2 has been tested with this workflow) in your Conda environment:
conda install -c conda-forge pynini=2.1.2
Data for each language (sets of strings) is generated according to FSAs (in the form of .fst
files) of given subregular languages. The process of data generation is outlined in the below 3 steps:
Provide a file encoding the possible transitions in a given subregular language in the form of an .att
file (example below). One file per language should be placed in the /src/fstlib/att_format
directory.
0 0 b b
0 0 c c
0 0 d d
0 1 a a
1 0 b b
1 0 c c
1 0 d d
0
1
The .att
files can be written by hand, but this is time-consuming and error-prone. We recommend writing automata with software such as openfst currently maintained by researchers at Google, or plebby
, which is included in The Language Toolkit by Dakotah Lambert. We have used plebby
for specifying the acceptors and openfst
for data generation through its Python wrapper Pynini. (See Adding new languages)
This script has 3 dependencies: (1) .att
files, (2) ins.txt
(input symbols) and (3) outs.txt
(output symbols), all of which should go in /src/fstlib/att_format
. The ins.txt
and outs.txt
files currently in the repo are set up for a max of 64 universe symbols and their UTF-8 encodings. Should you want to support more than 64 symbols, you should modify the script /src/fstlib/att_format/make-ins-and-outs.py
and run it (the ins.txt
and outs.txt
generated should also be copied into /src/subreglib
).
Once all the desired languages (as .att
files) and ins.txt
& outs.txt
are in /src/fstlib/att_format
, run the att2fst.sh
script from the fstlib
directory:
./att2fst.sh
This will create corresponding .fst
files (which are output to /src/fstlib/fst_format/
). These are binary files which can be processed by openfst
and are what pynini
uses to generate strings in a given language.
After the .fst
files are compiled, run data-gen.py
which is in /src
:
python /src/data-gen.py
This will generate Training, Dev, Test 1, Test 2, and Test 3 sets for the languages listed in /tags.txt
and store them in /data_gen
.
Check whether the data was generated successfully using check.py
. If any of the files are missing strings, a "missing" or "incomplete" message will be printed to the terminal.
In /data_gen
, there are three subsets generated: 1k
, 10k
, and 100k
. Each one contains _Training.txt
, _Dev.txt
, _Test1.txt
, _Test2.txt
, and _Test3.txt
for each language.
- Training data contains equal numbers of positive and negative strings between the lengths of 10 and 19.
- Dev data contains equal numbers of positive and negative strings between the lengths of 10 and 19 which are disjoint from Training.
- Test 1 data contains equal numbers of positive and negative strings between the lengths of 10 and 19 which are disjoint from both Training and Dev.
- Test 2 data contains equal numbers of positive and negative strings between the lengths of 31 and 50.
- Test 3 data contains equal numbers of positive and negative strings between the lengths of 31 and 50; in particular, each positive string x is paired with a negative string y such that the string edit distance of (x,y) is 1.
Equal numbers of strings of each length in the selected length range are chosen. Under these constraints, the strings themselves are randomly selected uniformly.
The script check.py
in /src
will check the data_gen
directory for the generated data. For each file size subset (1k, 10k, 100k) the 5 datasets for each language will be checked to determine if they exist/have sufficient strings. For each dataset, missing
will be output if it doesn't exist, or incomplete
will be output if the dataset hasn't achieved its designated file size (along with the amount it stopped at).
The currently supported RNN types are s-RNN, GRU and LSTM.
To train a single model, run main.py
in /src/neural_net/tensorflow
. Most of its arguments are self-expanatory, but note that the --bidi
flag denotes whether the model's RNN is bidirectional. Example:
python src/neural_net/tensorflow/main.py --batch-size 64 --epochs 30 --embed-dim 100 --rnn-type gru --dropout 0 --bidi False --train-data "/src/data_gen/data/100k/SL.4.2.0_Training.txt" --val-data "/src/data_gen/data/100k/SL.4.2.0_Dev.txt" --output-dir "/src/models"
Possible arguments for main.py
are below (along with indications of whether they are required):
'--train-data', type=str, required=True
'--val-data', type=str, required=True
'--output-dir', type=str, required=True
'--batch-size', type=int, default=64
'--epochs', type=int, default=10
'--embed-dim', type=int, default=100
'--dropout', type=float, default=0.2
'--rnn-type', type=str, default="simple"
(valid values: "simple", "gru" or "lstm")'--bidi', type=bool, default=False
To produce a set of predictions from a data file (test set) and a trained model, run predict.py
in /src/neural_net/tensorflow/
with the model directory and test set file as arguments. This will output the model's predictions to a file consisting of the name of the test data file + _pred.txt
in the model's directory. Example:
python src/neural_net/tensorflow/predict.py --model-dir "models/Bi_lstm_NoDrop_SL.4.2.1_100k" --data-file "data_gen/10k/SL.4.2.0_Test1.txt"
To evaluate a model's predictions, run the script eval.py
in /src/neural_net/tensorflow
. This program takes a prediction file as produced by predict.py
and writes to an equivalently-named _eval.txt
file also in the model's directory. This file reports a number of statistics regarding the model's predictions. Example:
python src/neural_net/tensorflow/eval.py --predict-file "models/BiGRU_NoDrop_SL.4.2.1_100k/Test1_pred.txt"
The script batch_train.sh
will train multiple models according to the parameters at the top of the file under the heading 'PARAMETERS TO EDIT' (shown below). This script depends on the languages listed in /tags.txt
(one language per line) which must also have a corresponding .fst
file in /src/fstlib/lib_fst
.
UNIVERSAL_ARGS=( --batch-size 64 --epochs 30 --embed-dim 100 )
rnn_type="simple" # simple / gru / lstm
These default values can be tweaked as needed. The UNIVERSAL_ARGS
list can be edited to vary the batch size, epochs, and embedding layers. The rnn_type
value can be changed to any of the 3 supported neural network types. With the configuration above, the script will train s-RNN models for all the languages listed in /tags.txt
, one for each data size.
For each model, the script will also run predict.py
for each test set to generate three _pred.txt
files. Then, it will run eval.py
on each prediction file. These files are all generated in the corresponding model's directory.
After models have been trained and evaluated (i.e. _eval.txt
files exist in /models
), run the script collect_evals.sh
. This will collect all of the _eval.txt
files present in /models/
and output their evaluation metrics (currently F score and accuracy) into /all_evals.txt
.
To organize these results nicely into a .csv
file, run /evals2csv.py
. This will generate a file /all_evals.csv
.
/src/subreglib
contains .plebby
files which can be used with plebby
which is included in The Language Toolkit by Dakotah Lambert. Each file specifies the acceptors for various languages and, after being run via plebby
, outputs .att
files for each language. A usage guide for plebby
is here. For more info about the organization of .plebby
files for use in this library, refer to the README in /src/subreglib
.
This repository is the latest installment of work by several individuals since 2017 and is a continuation of the work reported by Avcu et al. (2017). with Heinz, Fodor, and Shibata overseeing its development. Apart from Shibata, the researchers are based at Stony Brook University.
- Derek Andersen (CompLing MA 2021)
- Joanne Chau (CompLing MA 2020)
- Paul Fodor (Professor of Instruction, CS)
- Tiantian Gao (CS, PhD 2019)
- Jeffrey Heinz (Professor, Linguistics & IACS)
- Dakotah Lambert (Ling, current PhD)
- Kalina Kostyszyn (Ling, current PhD)
- Emily Peterson (Ling, CompLing MA 2020)
- Chihiro Shibata (Professor, Hosei University)
- Cody St. Clair (Ling MA 2020)
- Sam van der Poel (current AMS MA)
- Rahul Verma (CS, MS 2018)
Most recently, this repository is a continuation by Derek Andersen of the work done by Peterson, St. Clair, and Chau here. A summary of the results from their work is in /docs/2020_report.pdf
. They themselves forked Kostyszyn's repo. Kostyszyn inherited the code from Gao, and Verma originated the code base for the project.