We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and AGI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to fuwei@microsoft.com.
TorchScale - Transformers at (any) Scale (repo)
Fundamental research to improve modeling generality and capability, as well as training stability and efficiency for Transformers at any scale.
Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond
Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)
Capability - A Length-Extrapolatable Transformer
Efficiency & Transferability - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)
MetaLM: Language Models are General-Purpose Interfaces
Kosmos-1: A Multimodal Large Language Model (MLLM)
The Big Convergence - Large-scale self-supervised pre-training across tasks
(predictive and generative), languages
(100+ languages), and modalities
(language, image, audio, layout/format + language, vision + language, audio + language, etc.)
UniLM: unified pre-training for language understanding and generation
InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages
DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages
MiniLM: small and fast pre-trained models for language understanding and generation
AdaLM: domain, language, and task adaptation of pre-trained models
EdgeLM(
NEW
): small pre-trained models on edge/client devices
SimLM (
NEW
): large-scale pre-training for similarity matching
E5 (
NEW
): text embeddings
BEiT/BEiT-2: generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers
DiT (
NEW
): self-supervised pre-training for Document Image Transformers
WavLM: speech pre-training for full stack tasks
VALL-E: a neural codec language model for TTS
LayoutLM/LayoutLMv2/LayoutLMv3: multimodal (text + layout/format + image) Document Foundation Model for Document AI (e.g. scanned documents, PDF, etc.)
LayoutXLM: multimodal (text + layout/format + image) Document Foundation Model for multilingual Document AI
MarkupLM: markup language model pre-training for visually-rich document understanding
XDoc: unified pre-training for cross-format document understanding
UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR
UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training
SpeechT5: encoder-decoder pre-training for spoken language processing
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
VLMo: Unified vision-language pre-training
VL-BEiT (
NEW
): Generative Vision-Language Pre-training - evolution of BEiT to multimodal
BEiT-3 (
NEW
): a general-purpose multimodal foundation model, and a major milestone of The Big Convergence of Large-scale Pre-training Across Tasks, Languages, and Modalities.
s2s-ft: sequence-to-sequence fine-tuning toolkit
Aggressive Decoding (
NEW
): lossless and efficient sequence-to-sequence decoding algorithm
TrOCR: transformer-based OCR w/ pre-trained models
LayoutReader: pre-training of text and layout for reading order detection
XLM-T: multilingual NMT w/ pretrained cross-lingual encoders
LLMOps - General technology for enabling AI capabilities w/ LLMs and MLLMs (repo)
- [Model Release] March, 2023: BEiT-3 pretrained models and code.
- March, 2023: Kosmos-1 - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).
- January, 2023: VALL-E a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. See https://aka.ms/valle for demos of our work.
- [Model Release] January, 2023: E5 - Text Embeddings by Weakly-Supervised Contrastive Pre-training.
- November, 2022: TorchScale 0.1.1 was released!
- November, 2022: TrOCR was accepted by AAAI 2023.
- [Model Release] November, 2022: XDoc BASE models for cross-format document understanding.
- [Model Release] September, 2022: TrOCR BASE and LARGE models for Scene Text Recognition (STR).
- [Model Release] September, 2022: BEiT v2 code and pretrained models.
- August, 2022: BEiT-3 - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
- July, 2022: SimLM - Large-scale self-supervised pre-training for similarity matching
- June, 2022: DiT and LayoutLMv3 were accepted by ACM Multimedia 2022.
- June, 2022: MetaLM - Language models are general-purpose interfaces to foudation models (language/multilingual, vision, speech, and multimodal)
- June, 2022: VL-BEiT - bidirectional multimodal Transformer learned from scratch with one unified pretraining task, one shared backbone, and one-stage training, supporting both vision and vision-language tasks.
- [Model Release] June, 2022: LayoutLMv3 Chinese - Chinese version of LayoutLMv3
- [Code Release] May, 2022: Aggressive Decoding - Lossless Speedup for Seq2seq Generation
- April, 2022: Transformers at Scale = DeepNet + X-MoE
- [Model Release] April, 2022: LayoutLMv3 - Pre-training for Document AI with Unified Text and Image Masking
- [Model Release] March, 2022: EdgeFormer - Parameter-efficient Transformer for On-device Seq2seq Generation
- [Model Release] March, 2022: DiT - Self-supervised Document Image Transformer. Demos: Document Layout Analysis, Document Image Classification
- January, 2022: BEiT was accepted by ICLR 2022 as Oral presentation (54 out of 3391).
- [Model Release] December 16th, 2021: TrOCR small models for handwritten and printed texts, with 3x inference speedup.
- November 24th, 2021: VLMo as the new SOTA on the VQA Challenge
- November, 2021: Multilingual translation at scale: 10000 language pairs and beyond
- [Model Release] November, 2021: MarkupLM - Pre-training for text and markup language (e.g. HTML/XML)
- [Model Release] November, 2021: VLMo - Unified vision-language pre-training w/ BEiT
- October, 2021: WavLM Large achieves state-of-the-art performance on the SUPERB benchmark
- [Model Release] October, 2021: WavLM - Large-scale self-supervised pre-trained models for speech.
- [Model Release] October 2021: TrOCR is on HuggingFace
- September 28th, 2021: T-ULRv5 (aka XLM-E/InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
- [Model Release] September, 2021: LayoutLM-cased are on HuggingFace
- [Model Release] September, 2021: TrOCR - Transformer-based OCR w/ pre-trained BEiT and RoBERTa models.
- August 2021: LayoutLMv2 and LayoutXLM are on HuggingFace
- [Model Release] August, 2021: LayoutReader - Built with LayoutLM to improve general reading order detection.
- [Model Release] August, 2021: DeltaLM - Encoder-decoder pre-training for language generation and translation.
- August 2021: BEiT is on HuggingFace
- [Model Release] July, 2021: BEiT - Towards BERT moment for CV
- [Model Release] June, 2021: LayoutLMv2, LayoutXLM, MiniLMv2, and AdaLM.
- May, 2021: LayoutLMv2, InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
- April, 2021: LayoutXLM is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark XFUND is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
- March, 2021: InfoXLM was accepted by NAACL 2021.
- December 29th, 2020: LayoutLMv2 is coming with the new SOTA on a wide varierty of document AI tasks, including DocVQA and SROIE leaderboard.
- October 8th, 2020: T-ULRv2 (aka InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
- September, 2020: MiniLM was accepted by NeurIPS 2020.
- July 16, 2020: InfoXLM (Multilingual UniLM) arXiv
- June, 2020: UniLMv2 was accepted by ICML 2020; LayoutLM was accepted by KDD 2020.
- April 5, 2020: Multilingual MiniLM released!
- September, 2019: UniLMv1 was accepted by NeurIPS 2019.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project.
Microsoft Open Source Code of Conduct
For help or issues using the pre-trained models, please submit a GitHub issue.
For other communications, please contact Furu Wei (fuwei@microsoft.com
).