Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal

Code Release Here

Code is rehosted at part of the i-code project

Open Source Checklist:

  • Release Model (Encoder + Text decoder)
  • Release Most Scripts
  • Vision Decoder / Weights (Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API)
  • Demos

Introduction

UDOP unifies vision, text, and layout through vision-text-layout Transformer and unified generative pretraining tasks including vision task, text task, layout task, and mixed task. We show the task prompts (left) and task targets (right) for all self-supervised objectives (joint text-layout reconstruction, visual text recognition, layout modeling, and masked autoencoding) and two example supervised objectives (question answering and layout analysis).