/CodeTrans

Pretrained Language Models for Source code

Primary LanguageJupyter NotebookMIT LicenseMIT


CodeTrans



CodeTrans is providing state of the art pre-trained models for source code. CodeTrans was trained on several Nvidia RTX 8000 GPUs and couple of Google TPUs using various State of the Art Transformers Models.

Take a look into our paper CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing for more information about our work.


CodeTrans Attention Visualization


This repository will be updated regulary with new pre-trained models for source code as part of supporting software engineering community in general, and Source Code for Covid-19 research specifically.

Table of Contents

⌛️  Models Availability

All CodeTrans original Tensorflow checkpoints are downloadable from this dropbox folder and the pytorch checkpoints in the Hugging Face model hub.

You can download all the datasets used in this research from dropbox folder.

🚀  Usage

How to use CodeTrans:

  • 🤖  Feature Extraction (FE):
    coming soon.

  • 💥  Fine Tuning (FT):
    coming soon.

  • ⚗️  Code Sequences Generation:
    coming soon.

  • 🧐  Visualization:
    coming soon.

  • 📈  Benchmark:
    coming soon.

📊  Expected Results

  • 💻  Function Documentation Generation (Bleu):
Language / Model Python Java Go Php Ruby JavaScript
CodeTrans-ST-Small 17.31 16.65 16.89 23.05 9.19 13.7
CodeTrans-ST-Base 16.86 17.17 17.16 22.98 8.23 13.17
CodeTrans-TF-Small 19.93 19.48 18.88 25.35 13.15 17.23
CodeTrans-TF-Base 20.26 20.19 19.50 25.84 14.07 18.25
CodeTrans-TF-Large 20.35 20.06 19.54 26.18 14.94 18.98
CodeTrans-MT-Small 19.64 19.00 19.15 24.68 14.91 15.26
CodeTrans-MT-Base 20.39 21.22 19.43 26.23 15.26 16.11
CodeTrans-MT-Large 20.18 21.87 19.38 26.08 15.00 16.23
CodeTrans-MT-TF-Small 19.77 20.04 19.36 25.55 13.70 17.24
CodeTrans-MT-TF-Base 19.77 21.12 18.86 25.79 14.24 18.62
CodeTrans-MT-TF-Large 18.94 21.42 18.77 26.20 14.19 18.83
State of the art 19.06 17.65 18.07 25.16 12.16 14.90

  • 💻  Source Code Summarization (Bleu):
Language / Model Python SQL C#
CodeTrans-ST-Small 8.45 17.55 19.74
CodeTrans-ST-Base 9.12 15.00 18.65
CodeTrans-TF-Small 10.06 17.71 20.40
CodeTrans-TF-Base 10.94 17.66 21.12
CodeTrans-TF-Large 12.41 18.40 21.43
CodeTrans-MT-Small 13.11 19.15 22.39
CodeTrans-MT-Base 13.37 19.24 23.20
CodeTrans-MT-Large 13.24 19.40 23.57
CodeTrans-MT-TF-Small 12.10 18.25 22.03
CodeTrans-MT-TF-Base 10.64 16.91 21.40
CodeTrans-MT-TF-Large 12.14 19.98 21.10
State of the art -- 18.40 20.50

  • 💻  Code Comment Generation (Bleu):
Language / Model Java
CodeTrans-ST-Small 37.98
CodeTrans-ST-Base 38.07
CodeTrans-TF-Small 38.56
CodeTrans-TF-Base 39.06
CodeTrans-TF-Large 39.50
CodeTrans-MT-Small 20.15
CodeTrans-MT-Base 27.44
CodeTrans-MT-Large 34.69
CodeTrans-MT-TF-Small 38.37
CodeTrans-MT-TF-Base 38.90
CodeTrans-MT-TF-Large 39.25
State of the art 38.17

  • 💻  Commit Message Generation (Bleu):
Language / Model Java
CodeTrans-ST-Small 39.61
CodeTrans-ST-Base 38.67
CodeTrans-TF-Small 44.22
CodeTrans-TF-Base 44.17
CodeTrans-TF-Large 44.41
CodeTrans-MT-Small 36.17
CodeTrans-MT-Base 39.25
CodeTrans-MT-Large 41.18
CodeTrans-MT-TF-Small 43.96
CodeTrans-MT-TF-Base 44.19
CodeTrans-MT-TF-Large 44.34
State of the art 32.81

  • 💻  API Sequence Recommendation (Bleu):
Language / Model Java
CodeTrans-ST-Small 68.71
CodeTrans-ST-Base 70.45
CodeTrans-TF-Small 68.90
CodeTrans-TF-Base 72.11
CodeTrans-TF-Large 73.26
CodeTrans-MT-Small 58.43
CodeTrans-MT-Base 67.97
CodeTrans-MT-Large 72.29
CodeTrans-MT-TF-Small 69.29
CodeTrans-MT-TF-Base 72.89
CodeTrans-MT-TF-Large 73.39
State of the art 54.42

  • 💻  Programming Language and Synthesis (Accuracy):
Language / Model LISP
CodeTrans-ST-Small 89.43
CodeTrans-ST-Base 89.65
CodeTrans-TF-Small 90.30
CodeTrans-TF-Base 90.24
CodeTrans-TF-Large 90.21
CodeTrans-MT-Small 82.88
CodeTrans-MT-Base 86.99
CodeTrans-MT-Large 90.27
CodeTrans-MT-TF-Small 90.31
CodeTrans-MT-TF-Base 90.30
CodeTrans-MT-TF-Large 90.17
State of the art 85.80

❤️  Community and Contributions

The CodeTrans project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.

📫  Have a question?

We are happy to hear your question in our issues page CodeTrans! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via our RostLab email

🤝  Found a bug?

Feel free to file a new issue with a respective title and description on the the CodeTrans repository. If you already found a solution to your problem, we would love to review your pull request!.

✅  Requirements

For prediction, Text to Text libraray is needed. For source code feature extraction or fine-tuning our pre-trained models, Pytorch and Transformers library from huggingface is needed. For model visualization, you need to install BertViz library.

🤵  Team

  • Technical University of Munich:
Ahmed Elnaggar Wei Ding Florian Matthes Burkhard Rost
  • Google:
Llion Jones
  • Nvidia:
Tom Gibbs Tamas Feher Christoph Angerer

💰  Sponsors

Google Google Nvidia Software Campus

📘  License

The CodeTrans pretrained models are released under the under terms of the MIT License.

✏️  Citation

If you use this code or our pretrained models for your publication, please cite the original paper:

@misc{elnaggar2021codetrans,
      title={CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing}, 
      author={Ahmed Elnaggar and Wei Ding and Llion Jones and Tom Gibbs and Tamas Feher and Christoph Angerer and Silvia Severini and Florian Matthes and Burkhard Rost},
      year={2021},
      eprint={2104.02443},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}