Optimizing Deeper Transformers on Small Datasets https://arxiv.org/abs/2012.15355
Primary LanguagePython