/Pythia-AI-code-completion

Project reproducing paper: "Pythia, AI-assisted code completion system". The project was done for the course "Machine Learning for Software Engineering" at TU Delft.

Primary LanguagePython

Group 3: Code Completion

Paper replicating: Pythia: AI-assisted Code Completion System
Dataset: 150k Python Dataset

Dataset statistics

Python150 marks 100k files as training files and 50k files as evaluation/test files. Using a deduplication tool the original files were deduplicated resulting in 84728 training files and 42372 test files.

In the preprocessing phase files which couldn't be parsed to an AST (e.g. because the Python version was too old) were removed. This reduced the training files to 75183 AST's. From these AST's a vocabulary of tokens is built using a threshold of 20. This means that tokens which occur in the training set more than 20 times will be added to the vocabulary. This resulted in a vocabulary of size 43853.

Experiments

Shared parameters:

batch size: 64
embed dimension: 150
hidden dimension: 500
num LSTM layers: 2
lookback tokens: 100
norm clipping: 10
initial learning rate: 2e-3
learning rate schedule: decay of 0.97 every epoch
epochs: 15

Deployed code: experiments release

Data used:
Training set (1970000 items): download here
Validation set (227255 items): download here
Evaluation set (911213 items): download here
Vocabulary (size: 43853): download here

Experiment 1 | Regularization - L2 regularizer

L2 parameter of 1e-6 (also done here).

Top-1 accuracy Top-5 accuracy
Validation set 46.61% 71.67%
Evaluation set 47.89% 69.76%

plot

Resulting model: final_model_experiment_1

Experiment 2 | Regularization - Dropout

Dropout parameter of 0.8 (based on Pythia).

Top-1 accuracy Top-5 accuracy
Validation set 38.53% 63.31%
Evaluation set 39.37% 61.15%

plot

Resulting model: final_model_experiment_2

Experiment 3 | No regularization

No L2, dropout or weighted loss.

Top-1 accuracy Top-5 accuracy
Validation set 43.03% 67.37%
Evaluation set 46.63% 68.67%

plot

Resulting model: final_model_experiment_3

Experiment 4 | Regularization - Weighted loss + L2

Includes a weighted loss + L2 (1e-6).

Top-1 accuracy Top-5 accuracy
Validation set 40.53% 63.62%
Evaluation set 41.31% 60.84%

plot

Resulting model: final_model_experiment_4

Experiment 5 | bi-directional LSTM

Using a bi-directional LSTM instead of uni-directional. Also includes L2 regularizer (1e-6).

Top-1 accuracy Top-5 accuracy
Validation set 48.60% 71.49%
Evaluation set 49.87% 70.11%

plot

Resulting model: final_model_experiment_5

Experiment 6 | attention

Using an attention mechanism. Also includes L2 regularizer.

Top-1 accuracy Top-5 accuracy
Validation set 51.10% 73.95%
Evaluation set 52.90% 72.86%

plot

Resulting model: final_model_experiment_6

Experiment 7 | attention

Using an attention mechanism. Also includes L2 regularizer (1e-6) and a (lower) dropout (0.4).

Top-1 accuracy Top-5 accuracy
Validation set 51.51% 74.17%
Evaluation set 53.70% 73.22%

plot

Resulting model: final_model_experiment_7

Experiment 8 | Regularizer - (low) dropout

Includes L2 regularizer (1e-6) and a (lower) dropout (0.4)

Top-1 accuracy Top-5 accuracy
Validation set 47.31% 71.89%
Evaluation set 48.56% 70.71%

plot Resulting model: final_model_experiment_8

Experiment 9 | Attention different

Includes L2 regularizer (1e-6) and a (lower) dropout (0.4)

Top-1 accuracy Top-5 accuracy
Validation set 50.67% 73.38%
Evaluation set 52.69%% 72.36%

plot Resulting model: final_model_experiment_9

Experiment 10 | Attention final

Includes L2 regularizer (1e-6) and a (lower) dropout (0.4). Runs for 30 epochs.

Top-1 accuracy Top-5 accuracy
Validation set 52.20% 75.12%
Evaluation set 54.80% 74.51%

plot Resulting model: final_model_experiment_10

Experiment 11 | Best regularizer

Includes L2 regularizer (1e-6) and a (lower) dropout (0.4).

Top-1 accuracy Top-5 accuracy
Validation set 47.26% 71.96%
Evaluation set 48.51% 70.60%

plot Resulting model: final_model_experiment_11