This is complete code for a sequence to sequence transformer, the goal is to train it (supervisedly) to translate short sequences of integers into short sequences of integers.
One way to run it out of the box is to run (on an environment where you have pytorch and numpy installed):
python train.py --dump_path /some_path_on_your_computer/ --exp_name my_first_experiment --exp_id 1 --operation "gcd"
This should train a transformer to compute the GCD of integers, encoded in base 1000, generated on the fly (between 1 and a million).
You can try other operations with the parameter --operation : modular_add, modular_mul, fraction_compare, fraction_add, fraction_simplify, matrix_rank...
Modular operations use modulo 67 by default, you can change it with --modulo
Integers are encoded in base 1000 by default, you can change it with --base
These problem specific parameters are to be found in src/envs/arithmetic.py
If you don't have a NVIDIA GPU, you can train on CPU by stating --cpu true (it will be slow, you may want to set --epoch_size to something small, like 10000...)
If everything goes fine, the log should appear on the screen. It will also be saved in
/some_path_on_your_computer/my_first_experiment/1/train.log
(if you don't provide an exp_id, the program will create one for you)
After indicating the hyperparameters you use, and the model you train (and its number of parameters), the logs will describe the training, which is a series of training epochs (300,000 examples by default, can be changed via --epoch_size), separated by evaluations over a test set (generated on the fly in the case of the GCD). During epochs, the model will report the training loss (this should go down, but it can be bumpy, if it bumps too much your learning rate is probably too large).
INFO - 09/18/24 20:41:05 - 0:00:16 - 400 - 977.02 equations/s - 9768.07 words/s - ARITHMETIC: 0.5479 - LR: 1.0000e-04
INFO - 09/18/24 20:41:11 - 0:00:23 - 600 - 940.88 equations/s - 9408.46 words/s - ARITHMETIC: 0.4181 - LR: 1.0000e-04
INFO - 09/18/24 20:41:18 - 0:00:30 - 800 - 955.94 equations/s - 9558.79 words/s - ARITHMETIC: 0.3715 - LR: 1.0000e-04
Training losses are logged every 200 optimization steps (this is configurable, in trainer.py), here, you see it going down 0.55 to 0.42 to 0.37. Life is good! The eq/s and words/s give an idea of the learning speed, eqs are examples: here with 950 examples / s you expect to complete a 300k example epoch in a little more than 6 minutes (we are on a GPU).
At the end of each epoch, the model runs test on a sample of size --eval_size (10k by default, you can make this smaller), examples are evaluated in batches of --batch_size_eval. During evaluation, the lines
INFO - 09/18/24 20:57:28 - 0:16:39 - (7168/10000) Found 102/128 valid top-1 predictions. Generating solutions ...
INFO - 09/18/24 20:57:28 - 0:16:39 - Found 102/128 solutions in beam hypotheses.
indicate how many solutions were correct in each eval batch (here 102 our of 128). At the end, you should have a small report saying :
INFO - 09/18/24 20:57:29 - 0:16:40 - 8459/10000 (84.59%) equations were evaluated correctly.
INFO - 09/18/24 20:57:29 - 0:16:40 - 1: 6104 / 6107 (99.95%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 2: 1581 / 1581 (100.00%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 4: 356 / 356 (100.00%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 5: 241 / 245 (98.37%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 8: 86 / 86 (100.00%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 10: 56 / 57 (98.25%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 20: 22 / 22 (100.00%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 25: 8 / 8 (100.00%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 40: 2 / 2 (100.00%)
INFO - 09/18/24 20:57:29 - 0:16:40 - 50: 3 / 3 (100.00%)
84.6% of the test GCD were correctly calculated. Correct model predictions were GCD 1, 2, 4, 5, 8 .... products of powers of divisors of the base...
At the end of the peoch, the model exports a python dictionary containing detailed results. This is what you want to load (in a notebook) to draw learning curves, etc.
Training and test files can be provided with the parameters: --train_data
and --eval_data
(setting --eval_size
to -1
will cause the model to evaluate on all the eval data).
Training and test examples are written, one per line, as sequence of tokens, separated by whitespaces, the input and output being separated by a tab.
One specify the data type of the input and output, e.g.: --operation "data" --data_types '"int[5]:int"
The supported data types at the moment are:
-
int
-- an integerencoded as
p ad ... a0
wherep
in{+, -}
andai
are the digits ofa
in base1000
(by default), e.g.,-3500
is represented as- 3 500
-
int[n]
-- an integer array of lengthrepresented as
Vn z1 ... zn
wherezi
are encoded as above -
range(a, b)
-- an integer in the range{a,...,b-1}
encoded as a string in base 10, e.g., via 1101
.
For example, here are some python functions to encode the above data types, respectively:
def encode_integer(val, base=1000, digit_sep=" "):
if val == 0:
return '+ 0'
sgn = '+' if val >= 0 else '-'
val = abs(val)
r = []
while val > 0:
r.append(str(val % base))
val = val//base
r.append(sgn)
r.reverse()
return digit_sep.join(r)
def encode_integer_array(x, base=1000):
return f'V{len(x)} ' + " ".join(encode_integer(int(z), base) for z in x)
def encode_range(x):
return str(int(x))
For example, for GCD
we would use int[2]:int
where
V2 + 1 24 + 16\t+ 16\n
represents GCD (1024, 16) = 16
, in base 1000
. Note that here V2
, 1
, 24
, 16
are words/tokens.
For an elliptic curve and if it has nontrivial rank, I would have something like int[5]:range(2)
V5 + 0 - 1 + 0 - 84 375 258 - 298 283 918 238\t1
The code is organised as follows:
train.py is the main file, you run python train.py with some parameters, you can train from generated data (using envs/generators), generate and export data (setting --export_data to true), or train and test from external data (using train_data and test_data)
src/envs contain the math-specific code