/tensor2tensor

A library for generalized sequence to sequence models

Primary LanguagePythonApache License 2.0Apache-2.0

T2T: Tensor2Tensor Transformers

T2T is a modular and extensible library and binaries for supervised learning with TensorFlow and with a focus on sequence tasks. Actively used and maintained by researchers and engineers within Google Brain, T2T strives to maximize idea bandwidth and minimize execution latency.

T2T is particularly well-suited to researchers working on sequence tasks. We're eager to collaborate with you on extending T2T's powers, so please feel free to open an issue on GitHub to kick off a discussion and send along pull requests, See our contribution doc for details and our open issues.

T2T overview

pip install tensor2tensor

PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM

mv $TMP_DIR/tokens.vocab.32768 $DATA_DIR

# Train
t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR

# Decode

DECODE_FILE=$DATA_DIR/decode_this.txt
echo "Hello world" >> $DECODE_FILE
echo "Goodbye world" >> $DECODE_FILE

BEAM_SIZE=4
ALPHA=0.6

t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=0 \
  --eval_steps=0 \
  --beam_size=$BEAM_SIZE \
  --alpha=$ALPHA \
  --decode_from_file=$DECODE_FILE

cat $DECODE_FILE.$MODEL.$HPARAMS.beam$BEAM_SIZE.alpha$ALPHA.decodes

T2T modularizes training into several components, each of which can be seen in use in the above commands.

See the models, problems, and hyperparameter sets that are available:

t2t-trainer --registry_help

Datasets

Datasets are all standardized on TFRecord files with tensorflow.Example protocol buffers. All datasets are registered and generated with the data generator and many common sequence datasets are already available for generation and use.

Problems and Modalities

Problems define training-time hyperparameters for the dataset and task, mainly by setting input and output modalities (e.g. symbol, image, audio, label) and vocabularies, if applicable. All problems are defined in problem_hparams.py. Modalities, defined in modality.py, abstract away the input and output data types so that models may deal with modality-independent tensors.

Models

T2TModels define the core tensor-to-tensor transformation, independent of input/output modality or task. Models take dense tensors in and produce dense tensors that may then be transformed in a final step by a modality depending on the task (e.g. fed through a final linear transform to produce logits for a softmax over classes). All models are imported in models.py, inherit from T2TModel - defined in t2t_model.py

Hyperparameter Sets

Hyperparameter sets are defined and registered in code with @registry.register_hparams and are encoded in tf.contrib.training.HParams objects. The HParams are available to both the problem specification and the model. A basic set of hyperparameters are defined in common_hparams.py and hyperparameter set functions can compose other hyperparameter set functions.

Trainer

The trainer binary is the main entrypoint for training, evaluation, and inference. Users can easily switch between problems, models, and hyperparameter sets by using the --model, --problems, and --hparams_set flags. Specific hyperparameters can be overriden with the --hparams flag. --schedule and related flags control local and distributed training/evaluation (distributed training documentation).

Adding a dataset

See the data generators README.


Note: This is not an official Google product.