We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-trained models, please send your resume to fuwei@microsoft.com.
Large-scale self-supervised pre-training across tasks
(predictive and generative), languages
(100+ languages), and modalities
(language, image, audio, layout/format + language, vision + language, audio + language, etc.)
UniLM: unified pre-training for language understanding and generation
InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages
DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages
MiniLM: small and fast pre-trained models for language understanding and generation
AdaLM: domain, language, and task adaptation of pre-trained models
BEiT (
NEW
): generative self-supervised pre-training for image / BERT Pre-Training of Image Transformers
WavLM (
NEW
): speech pre-training for full stack tasks
LayoutLM: multimodal (text + layout/format + image) pre-training for Document AI (e.g. scanned documents, PDF, etc.)
LayoutXLM: multimodal (text + layout/format + image) pre-training for multilingual document understanding
MarkupLM (
NEW
): markup language model pre-training for visually-rich document understanding
UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR
UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training
SpeechT5 (
NEW
): encoder-decoder pre-training for spoken language processing
VLMo (
NEW
): Unified vision-language pre-training - evolution of BEiT to multimodal
s2s-ft: sequence-to-sequence fine-tuning toolkit
TrOCR (
NEW
): transformer-based OCR w/ pre-trained models
LayoutReader: pre-training of text and layout for reading order detection
XLM-T: multilingual NMT w/ pretrained cross-lingual encoders
- [Model Release] December 16th, 2021: TrOCR small models for handwritten and printed texts, with 3x inference speedup.
- November 24th, 2021: VLMo as the new SOTA on the VQA Challenge
- November, 2021: Multilingual translation at scale: 10000 language pairs and beyond
- [Model Release] November, 2021: MarkupLM
- [Model Release] November, 2021: VLMo - Unified vision-language pre-training w/ BEiT
- October, 2021: WavLM Large achieves state-of-the-art performance on the SUPERB benchmark
- [Model Release] October, 2021: WavLM - Large-scale self-supervised pre-trained models for speech.
- [Model Release] October 2021: TrOCR is on HuggingFace
- September 28th, 2021: T-ULRv5 (aka XLM-E/InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
- [Model Release] September, 2021: LayoutLM-cased are on HuggingFace
- [Model Release] September, 2021: TrOCR - Transformer-based OCR w/ pre-trained BEiT and RoBERTa models.
- August 2021: LayoutLMv2 and LayoutXLM are on HuggingFace
- [Model Release] August, 2021: LayoutReader - Built with LayoutLM to improve general reading order detection.
- [Model Release] August, 2021: DeltaLM - Encoder-decoder pre-training for language generation and translation.
- August 2021: BEiT is on HuggingFace
- [Model Release] July, 2021: BEiT - Towards BERT moment for CV
- [Model Release] June, 2021: LayoutLMv2, LayoutXLM, MiniLMv2, and AdaLM.
- May, 2021: LayoutLMv2, InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
- April, 2021: LayoutXLM is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark XFUND is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
- March, 2021: InfoXLM was accepted by NAACL 2021.
- December 29th, 2020: LayoutLMv2 is coming with the new SOTA on a wide varierty of document AI tasks, including DocVQA and SROIE leaderboard.
- October 8th, 2020: T-ULRv2 (aka InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
- September, 2020: MiniLM was accepted by NeurIPS 2020.
- July 16, 2020: InfoXLM (Multilingual UniLM) arXiv
- June, 2020: UniLMv2 was accepted by ICML 2020; LayoutLM was accepted by KDD 2020.
- April 5, 2020: Multilingual MiniLM released!
- September, 2019: UniLMv1 was accepted by NeurIPS 2019.
***** New October, 2021
: WavLM release *****
- WavLM (October 27, 2021): WavLM, a new pre-trained speech model, to solve full-stack downstream speech tasks. WavLM integrates the gated relative position embedding structure and the utterance mixing method, to model both spoken content and speaker identity preservation. WavLM is trained on 94k hours of public audio data, which is larger than other released checkpoints for English Speech modeling. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. "WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing"
***** New October, 2021
: MarkupLM release *****
- MarkupLM (October 19, 2021): MarkupLM, a simple yet effective pre-training approach for text and markup language. With the Transformer architecture, MarkupLM integrates different input embeddings including text embeddings, position embeddings, and XPath embeddings. Furthermore, we also propose new pre-training objectives that are specially designed for understanding the markup language. We evaluate the pre-trained MarkupLM model on the WebSRC and SWDE datasets. Experiments show that MarkupLM significantly outperforms several SOTA baselines in these tasks. "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding"
***** September, 2021
: TrOCR release *****
- TrOCR (September 22, 2021): Transformer-based OCR with pre-trained models, which leverages the Transformer architecture for both image understanding and bpe-level text generation. The TrOCR model is simple but effective (convolution free), and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models"
***** August, 2021
: LayoutReader release *****
- LayoutReader (August 26, 2021): pre-training of text and layout for reading order detection. The pre-trained LayoutReader significantly improves both open-source and commercial OCR engines in ordering text lines. Meanwhile, we also created a reading order benchmark dataset ReadingBank to further empower the research in this area. "LayoutReader: Pre-training of Text and Layout for Reading Order Detection
EMNLP 2021
"
***** August, 2021
: DeltaLM release *****
- DeltaLM (August, 2021): encoder-decoder pre-training for language generation and translation. DeltaLM ranks first on the WMT21 multilingual translation task. The task requires a model to translate between 102 languages. "DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders."
***** July, 2021
: BEiT release *****
- BEiT (June 15, 2021): BERT Pre-Training of Image Transformers. BEiT-large achieves state-of-the-art results on ADE20K (a big jump to 57.0 mIoU) for semantic segmentation. BEiT-large achieves state-of-the-art ImageNet top-1 accuracy (88.6%) under the setting without extra data other than ImageNet-22k. "BEiT: BERT Pre-Training of Image Transformers"
***** June, 2021
: LayoutXLM | AdaLM | MiniLMv2 release *****
- LayoutXLM (April 17, 2021): multimodal pre-training for multilingual visually-rich document understanding. The pre-trained LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the FUNSD and multilingual XFUND dataset including 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding"
- AdaLM (June 2021): a simple yet effective approach for domain adaptation of pre-trained models. Biomedical specific pre-trained models are released. "Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains
ACL 2021
" - MiniLMv2 (December, 2020): a simple yet effective task-agnostic knoweldge distillation method, namely multi-head self-attention relation distillation, for compressing large pre-trained Transformers into small and fast pre-trained models. MiniLMv2 significantly outperforms MiniLMv1. Both English and multilingual MiniLM models are released. "MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
ACL 2021
"
***** May, 2021
: LayoutLMv2 | LayoutXLM release *****
- LayoutLM 2.0 (December 29, 2020): multimodal pre-training for visually-rich document understanding by leveraging text, layout and image information in a single framework. It is coming with new SOTA on a wide range of document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
ACL 2021
"
***** February, 2020
: UniLM v2 | MiniLM v1 | LayoutLM v1 | s2s-ft v1 release *****
- LayoutLM 1.0 (February 18, 2020): pre-trained models for document (image) understanding (e.g. receipts, forms, etc.) . It achieves new SOTA results in several downstream tasks, including form understanding (the FUNSD dataset from 70.72 to 79.27), receipt understanding (the ICDAR 2019 SROIE leaderboard from 94.02 to 95.24) and document image classification (the RVL-CDIP dataset from 93.07 to 94.42). "LayoutLM: Pre-training of Text and Layout for Document Image Understanding
KDD 2020
" - s2s-ft 1.0 (February 26, 2020): A PyTorch package used to fine-tune pre-trained Transformers for sequence-to-sequence language generation. "s2s-ft: Fine-Tuning Pre-Trained Transformers for Sequence-to-Sequence Learning"
- MiniLM 1.0 (February 26, 2020): deep self-attention distillation is all you need (for task-agnostic knowledge distillation of pre-trained Transformers). MiniLM (12-layer, 384-hidden) achieves 2.7x speedup and comparable results over BERT-base (12-layer, 768-hidden) on NLU tasks as well as strong results on NLG tasks. The even smaller MiniLM (6-layer, 384-hidden) obtains 5.3x speedup and produces very competitive results. "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
NeurIPS 2020
" - UniLM 2.0 (February 28, 2020): unified pre-training of bi-directional LM (via autoencoding) and sequence-to-sequence LM (via partially autoregressive) w/ Pseudo-Masked Language Model for language understanding and generation. UniLM v2 achieves new SOTA in a wide range of natural language understanding and generation tasks. "UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training
ICML 2020
"
***** October 1st, 2019: UniLM v1 release *****
- UniLM v1 (September 30, 2019): the code and pre-trained models for the
NeurIPS 2019
paper entitled "Unified Language Model Pre-training for Natural Language Understanding and Generation". UniLM (v1) achieves the new SOTA results in NLG (especially sequence-to-sequence generation) tasks, including abstractive summarization (the Gigaword and CNN/DM datasets), question generation (the SQuAD QG dataset), etc.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project.
Microsoft Open Source Code of Conduct
For help or issues using the pre-trained models, please submit a GitHub issue.
For other communications, please contact Furu Wei (fuwei@microsoft.com
).