/auto_coding

A basic and simple tool for code auto completion

Primary LanguagePythonApache License 2.0Apache-2.0

AutoCoder

Contributions welcome

A basic and simple tool for code auto completion, fine-tuned from the pytorch pre-trained GPT-2 variants offered by the awesome 🤗 transformers library.

Demo

demo

Features

  • Write with Python or Java.

Blog linked to this project

Quick Start

Here provides three ways of quick-start. Before that,

Load from 🤗transformers models

Now there are two fine-tuned models uploded to 🤗transformers models library. They can be used easily as long as you pip install transformers

from transformers import AutoTokenizer,AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("congcongwang/gpt2_medium_fine_tuned_coder")
model = AutoModelWithLMHead.from_pretrained("congcongwang/gpt2_medium_fine_tuned_coder")
# or
# tokenizer = AutoTokenizer.from_pretrained("congcongwang/distilgpt2_fine_tuned_coder")
# model = AutoModelWithLMHead.from_pretrained("congcongwang/distilgpt2_fine_tuned_coder")
use_cuda=True
context="def factorial"
lang="python" # can be java as well.

if use_cuda:
    model.to("cuda")

input_ids = tokenizer.encode("<python> " + context,
                                     return_tensors='pt') if lang == "python" else tokenizer.encode(
            "<java> " + context, return_tensors='pt')
outputs = model.generate(input_ids=input_ids.to("cuda") if use_cuda else input_ids,
                         max_length=128,
                         temperature=0.7,
                         num_return_sequences=1)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Ready-to-go Interaction

git clone https://github.com/wangcongcong123/auto_coding.git
pip install -r requirements.txt
  1. Download the fine-tuned models, here are two versions provided.
  2. Unzip the model and move it to ./model (create it first)
  3. Run the interact: python interact.py

Fine-tuning yours

git clone <this repository>
pip install -r requirements.txt
  1. Preparing the dataset
  2. Start fine-tuning model: python train.py --model_select distilgpt2
  3. After fine-tuning, the model will be saved to ./model/distilgpt2_fine_tuned_coder/0_GPTSingleHead which is exactly the fine-tuned version as provided in Ready-to-go Interaction.

* For more params setting of training, python train.py -h

Generative examples

Good Python generation examples by fine-tuned GPT2-medium

-------------Example 1--------------------------------------
Context code: def factorial
Generated: 
def factorial(n):
    return 1 if n == 1 else n * factorial(n - 1)

--------------Example 2-------------------------------------
Context code: def sum_of_series(num_of_terms:int, power: int)
Generated:
def sum_of_series(num_of_terms:int, power: int) -> float:
    """Calculate Sum of Series.
    >>> sum_of_series(1, 1, 10)
    55.0
    >>> sum_of_series(1, 10, 100)
    49600.0
    """
    sum = (num_of_terms / 2) * power
    return sum

--------------Example 3-------------------------------------
Context code: def quick_sort(collection)
Generated:
def quick_sort(collection):
    """
    >>> quick_sort([0, 5, 3, 2, 2])
    [0, 2, 2, 3, 5]
    >>> quick_sort([])
    []
    >>> quick_sort([-2, -5, -45])
    [-45, -5, -2]
    """
    length = len(collection)
    if length <= 1:
        return collection
    else:
        # Use the last element as the first pivot
        pivot = collection.pop()
        # Put elements greater than pivot in greater list
        # Put elements lesser than pivot in lesser list
        greater, lesser = [], []
        for element in collection:
            if element > pivot:
                ...

Good Java generation examples by fine-tuned GPT2-medium

--------------Example 1-------------------------------------
Context code: Scanner input= new Scanner(System.in)
Generated:
Scanner input= new Scanner(System.in)
System.out.println("Enter the Octal number: ");
// Take octal number as input from user in a string
String oct = input.next();

--------------Example 2-------------------------------------
Context code: private static int CountCharacters(String str)
Generated:
private static int CountCharacters(String str) {
        return str.replaceAll("\\s", "").length();
}

* Although some generated examples look good, it needs to take a grain of salt to judge the model's actual performance. The model may simply "remembers" existing code in the training set well.

TODO list

  • Expand the dataset (and construct the dataset more carefeully) and increase context window. Try larger generative models like GPT-2 large or even GPT-3 variants as proposed recently if the computational resources are allowed.
  • Remove overlapping between training examples and dev examples for contamination studies. That says, to what extent the model memorizes examples rigidly or at surface heuristics level during training.
  • Try some adversarial examples (more complicated for model's reasoning capability testing purpose) to test the robustness of the model.
  • Integrate this into real-life use case such as a code editor - Sublime Text, where a threshold of joint probability may need to be studied for code snippet recommendations.
  • Try some ideas of location-aware code generation. For example, if a human coder is sitting writing a comment, the autocoder should be aware of the coder's context (left and right if available) to help complete the corresponding content.
  • Model size and inference efficiency is a problem in real-life use cases.
  • Do research in this problem domain to grab a general idea of what work has done in the literature for this particular problem.

Extra notes

  • For mutli-GPU training, it only works when torch==1.4.0. It will be not working when torch==1.5.0. No idea so far how to fix this issue.