As part of Microsoft’s AI at Scale initiative, we developed a family of large AI models called the Microsoft Turing models. The Turing models are trained with billions of pages of publicly available text, absorbing the nuances of language, grammar, knowledge, concepts and context. The models excel at multiple language tasks, such as completing predictions, reading comprehension, common sense reasoning, paraphrasing a lengthy speech, finding relevant passages across thousands of documents, or word-sense disambiguation.
We created the Microsoft Turing Academic Program (MS-TAP) as part of our commitment to democratizing AI, and the responsible development of Microsoft’s AI models. The program provides access to Turing models in order to support high-impact scholarly research, including efforts aimed at advancing principles of learning and reasoning, exploring novel applications, and pursuing better understanding of challenges and opportunities with regard to the ethical and responsible use of large-scale neural models.
These models have already been used to improve different language understanding tasks across Bing, Office, Dynamics, Edge and other productivity products. In Bing they are being used for caption generation, and question answering and summarization across 100+ languages. Across Office, Turing models are powering features that include Text Prediction in Word, Outlook and Teams, to help users type faster and with fewer mistakes; Suggested replies in Outlook to automatically recommends a response to an email and in Word; and Smart Find to enable a much broader set of search queries beyond “exact match.” In Dynamics, Turing has been adapted with business domain data using Azure technologies, which are fully compliant with user privacy and enterprise contractual obligations. Applied within Dynamics 365 Sales Insights the next best action with a customer is suggested based on previous interactions.
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model
- Microsoft Turing Universal Language Representation model, T-ULRv5, tops XTREME leaderboard and trains 100x faster
- Turing Bletchley: A Universal Image Language Representation model by Microsoft - Microsoft Research
- Microsoft Makes It Easier To Build Popular Language Representation Model At Large Scale
- How Azure Machine Learning Powers SuggestedReplies In Outlook
- Better Document Previews using the Microsoft Turing Model for Natural Language Representations - Microsoft Research
- How Azure Machine Learning Enables Powerpoint Designer
- Accelerate Your NLP Pipelines Using Huggingrface Transformers & ONNX Runtime
- OpenAI - Azure Supercomputer
- Microsoft details how it improved Bing's autosuggest recommendations with AI
We are excited to release a private preview of the Turing Natural Language Representation v5 (TNLRv5) model to our MS-TAP partners as part of our commitment to responsible AI development. MS-TAP partners will have access to the base (12-layer, 768 hidden, 12 attention heads, 184M parameters) and large (24-layer, 1024 hidden, 16 attention heads, 434M parameters) T-NLRv5 model.
T-NLRv5 integrates some of the best modeling techniques developed by Microsoft Research, Azure AI, and Microsoft Turing. The models are pretrained at large scale using an efficient training framework based on FastPT and DeepSpeed. T-NLRv5 is the state of the art at the top of SuperGLUE and GLUE leaderboards, further surpassing human performance and other models. Notably, T-NLRv5 first achieved human parity on MNLI and RTE on the GLUE benchmark, the last two GLUE tasks which human parity had not yet met. In addition, T-NLRv5 is more efficient than recent pretraining models, achieving comparable effectiveness with 50% fewer parameters and pretraining computing costs.
T-NLRv5 is largely based on our recent work, COCO-LM, a natural evolution of pretraining paradigm converging the benefits of ELECTRA-style models and corrective language model pretraining. Read more about TNLRv5 in our blog post.
This model employs an auxiliary transformer language model to corrupt an input text sequence, and the main transformer model is pretrained using the corrective language model task, which is to detect and correct tokens replaced by the auxiliary model. This augments the ELECTRA model family with language modeling capacity, bringing together the benefits from pretraining with adversarial signals generated from the auxiliary model and the language modeling capacity, which is handy for prompt-based learning.
Like other publicly available language models, the Microsoft Turing models are trained with billions of pages of publicly available text, and hence may have picked up biases around gender, race and more from these public documents. Mitigating negative effects from these biases is a hard, industry-wide issue and Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these Microsoft AI principles into practice throughout the company and have taken an extensive number of precautionary measures to prevent these implicit biases getting exhibited when using the models in our products. We strongly encourage developers to do the same by putting appropriate guardrails and mitigations in place before taking these models to production. Learn more about the Microsoft Turing language models limitations and risks.
1. configs
- tnlrv5-base-cased.json: configuration for base model.
- tnlrv5-large-cased.json: configuration for large model.
2. models
- tnlrv5_base.pt: checkpoint for base model.
- tnlrv5_large.pt: checkpoint for large model.
3. src
- tnlrv5: containing utility functions for tnlrv5 model
- GLUE
- download_glue_data.py: download GLUE data.
- run_classifier.py: main body of GLUE finetuning task.
- utils_for_glue.py: utility functions of glue datasets.
- SQuAD
- run_squad.py: main body of SQuAD finetuning task.
- utils_for_squad.py: utility functions of squad dataset.
- utils_squad_evaluate.py: utility functions of evaluating squad dataset.
4. vocab
- dict.txt: dictionary of vocabulary
- sp.model
-
Install git lfs:
apt-get update apt-get install git-lfs
-
Clone the repository using personal token:
-
Personal token: Create a personal token following this link.
-
Clone the repository
git lfs clone https://username@github.com/username/mstap-TNLR-harvard-cai-lu
[Password: Personal token]
-
-
Pytorch >= 1.6, CUDA version
- Installing PyTorch following this link
-
Apex (same version as CUDA PyTorch)
- Install apex following instruction here.
- Apex will be installed successfully if:
runtime api version: nvcc -V
are the same.import torch print(torch.__version__)
-
Transformers Install transformers using:
pip install transformers == 2.10.0
The GLUE dataset can be downloaded by running the following script
# Set path for this repository
export HOME_DIR=~/mstap-TNLR-harvard-cai-lu
cd ${HOME_DIR}
python src/download_glue_data.py
# Set path to read training/dev dataset that was downloaded in the previous step
export DATASET_PATH=${HOME_DIR}/glue_data/MNLI
# Set path to save the finetuned model and result score
export OUTPUT_PATH=${HOME_DIR}/mnli_base_ft/
export TASK_NAME=mnli
# Set model name (or checkpoint path) for finetuning
export CKPT_PATH=tnlrv5-base-cased
# Set max sequence length
export MAX_LEN=512
# Set config file
export CONFIG_FILE=${HOME_DIR}/configs/tnlrv5-base-cased.json
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.tnlrv5_base_cased.$MAX_LEN.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.tnlrv5_base_cased.$MAX_LEN.cache
# Setting the hyperparameters for the run
export BSZ=32
export LR=1e-5
export EPOCH=5
export WD=0.1
export WM=0.0625
CUDA_VISIBLE_DEVICES=0 python src/run_classifier.py \
--model_type tnlrv5 --model_name_or_path $CKPT_PATH --task_name $TASK_NAME \
--data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--config_name $CONFIG_FILE --tokenizer_name tnlrv5-cased \
--do_train --evaluate_during_training --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length $MAX_LEN --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16_init_loss_scale 128.0 --adam_epsilon 1e-6 --adam_betas "0.9,0.98" \
--dropout_prob 0.1 --cls_dropout_prob 0.1 \
--fp16 --fp16_opt_level O2 --seed 1
--seed
: 1
MNLI-m: 90.219
MNLI-mm: 90.155
# Set path to read training/dev dataset that was downloaded in the previous step
export DATASET_PATH=${HOME_DIR}/glue_data/MNLI
# Set path to save the finetuned model and result score
export OUTPUT_PATH=${HOME_DIR}/mnli_large_ft/
export TASK_NAME=mnli
# Set model name (or checkpoint path) for finetuning
export CKPT_PATH=tnlrv5-large-cased
# Set max sequence length
export MAX_LEN=512
# Set config file
export CONFIG_FILE=${HOME_DIR}/configs/tnlrv5-large-cased.json
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.tnlrv5_large_cased.$MAX_LEN.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.tnlrv5_large_cased.$MAX_LEN.cache
# Setting the hyperparameters for the run.
export BSZ=16 # CUDA out of memory if batch size is too large
export LR=3e-6
export EPOCH=2
export WD=0.1
export WM=0.0625
CUDA_VISIBLE_DEVICES=0 python src/run_classifier.py \
--model_type tnlrv5 --model_name_or_path $CKPT_PATH --task_name $TASK_NAME \
--data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--config_name $CONFIG_FILE --tokenizer_name tnlrv5-cased \
--do_train --evaluate_during_training --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length $MAX_LEN --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16_init_loss_scale 128.0 --adam_epsilon 1e-6 --adam_betas "0.9,0.98" \
--dropout_prob 0.1 --cls_dropout_prob 0.1 \
--fp16 --fp16_opt_level O2 --seed 1
--seed
: 1
MNLI-m: 91.544
MNLI-mm: 91.589
# Set path to read training/dev dataset that was downloaded in the previous step
export DATASET_PATH=${HOME_DIR}/glue_data/MNLI
# Set path to save the finetuned model and result score
export OUTPUT_PATH=${HOME_DIR}/mnli_large_ft/
export TASK_NAME=mnli
# Set model name (or checkpoint path) for finetuning
export CKPT_PATH=tnlrv5-large-cased
# Set max sequence length
export MAX_LEN=512
# Set config file
export CONFIG_FILE=${HOME_DIR}/configs/tnlrv5-large-cased.json
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.tnlrv5_large_cased.$MAX_LEN.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.tnlrv5_large_cased.$MAX_LEN.cache
# Setting the hyperparameters for the run.
# per_gpu_train_batch_size = train_batch_size / num_gpus = 32 / 8 = 4
export BSZ=4
export LR=3e-6
export EPOCH=2
export WD=0.1
export WM=0.0625
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 src/run_classifier.py \
--model_type tnlrv5 --model_name_or_path $CKPT_PATH --task_name $TASK_NAME \
--data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--config_name $CONFIG_FILE --tokenizer_name tnlrv5-cased \
--do_train --evaluate_during_training --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length $MAX_LEN --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16_init_loss_scale 128.0 --adam_epsilon 1e-6 --adam_betas "0.9,0.98" \
--dropout_prob 0.1 --cls_dropout_prob 0.1 \
--fp16 --fp16_opt_level O2 --seed 1
--seed
: 1
MNLI-m: 91.544
MNLI-mm: 91.507
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
# Set path for this repository
export HOME_DIR=~/mstap-TNLR-harvard-cai-lu
cd ${HOME_DIR}
# Set path to the location where the data will be downloaded
export DATASET_PATH=${HOME_DIR}/squad_data/
# Download the train & dev datset
mkdir -p ${DATASET_PATH}
# Train datset
export TRAIN_FILE=${DATASET_PATH}/train-v2.0.json
wget -O $TRAIN_FILE https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
# Dev datset
export DEV_FILE=${DATASET_PATH}/dev-v2.0.json
wget -O $DEV_FILE https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
# Set path to save the finetuned model and result score
export OUTPUT_PATH=${HOME_DIR}/squad_ft/
# Set path to the model checkpoint you need to test
export CKPT_PATH=tnlrv5-base-cased
# Set config file
export CONFIG_FILE=${HOME_DIR}/configs/tnlrv5-base-cased.json
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${TRAIN_FILE}_tnlrv5_base_cased.384doc.cache
export DEV_CACHE=${DEV_FILE}_tnlrv5_base_cased.384doc.cache
# Setting the hyperparameters for the run.
export BSZ=32
export LR=3e-5
export EPOCH=3
CUDA_VISIBLE_DEVICES=0 python src/run_squad.py \
--model_type tnlrv5 --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name tnlrv5-cased \
--train_file $TRAIN_FILE --predict_file $DEV_FILE \
--cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--do_train --do_eval \
--per_gpu_train_batch_size $BSZ --learning_rate $LR --num_train_epochs $EPOCH --gradient_accumulation_steps 1 \
--max_seq_length 384 --doc_stride 128 --output_dir $OUTPUT_PATH \
--version_2_with_negative --seed 1 --max_grad_norm 0 \
--weight_decay 0.1 --warmup_ratio 0.0625 \
--fp16_init_loss_scale 128.0 --adam_epsilon 1e-6 --adam_betas "0.9,0.98" \
--fp16_opt_level O2 --fp16
--seed
: 1
F1_score: 88.207
# Set path for this repository
export HOME_DIR=~/mstap-TNLR-harvard-cai-lu
cd ${HOME_DIR}
# Set path to the location where the data will be downloaded
export DATASET_PATH=${HOME_DIR}/squad_data/
# Download the train & dev datset
mkdir -p ${DATASET_PATH}
# Train datset
export TRAIN_FILE=${DATASET_PATH}/train-v2.0.json
wget -O $TRAIN_FILE https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
# Dev datset
export DEV_FILE=${DATASET_PATH}/dev-v2.0.json
wget -O $DEV_FILE https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
# Set path to save the finetuned model and result score
export OUTPUT_PATH=${HOME_DIR}/squad_ft/
# Set path to the model checkpoint you need to test
export CKPT_PATH=tnlrv5-base-cased
# Set config file
export CONFIG_FILE=${HOME_DIR}/configs/tnlrv5-base-cased.json
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${TRAIN_FILE}_tnlrv5_base_cased.384doc.cache
export DEV_CACHE=${DEV_FILE}_tnlrv5_base_cased.384doc.cache
# Setting the hyperparameters for the run.
# per_gpu_train_batch_size = train_batch_size / num_gpus = 32 / 8 = 4
export BSZ=4
export LR=3e-5
export EPOCH=3
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 src/run_squad.py \
--model_type tnlrv5 --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name tnlrv5-cased \
--train_file $TRAIN_FILE --predict_file $DEV_FILE \
--cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--do_train --do_eval \
--per_gpu_train_batch_size $BSZ --learning_rate $LR --num_train_epochs $EPOCH --gradient_accumulation_steps 1 \
--max_seq_length 384 --doc_stride 128 --output_dir $OUTPUT_PATH \
--version_2_with_negative --seed 1 --max_grad_norm 0 \
--weight_decay 0.1 --warmup_ratio 0.0625 \
--fp16_init_loss_scale 128.0 --adam_epsilon 1e-6 --adam_betas "0.9,0.98" \
--fp16_opt_level O2 --fp16
--seed
: 1
F1_score: 88.407
- COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
- Unified Language Model Pre-training for Natural Language Understanding and Generation
- DeBERTa: Decoding-enhanced BERT with Disentangled Attention
See CONTRIBUTING.md.
See LICENSE.txt.
See SECURITY.md.
Please email us at turing-academic@microsoft.com for troubleshooting, or file an issue through the repo