Paper replicating: Pythia: AI-assisted Code Completion System
Dataset: 150k Python Dataset
Python150 marks 100k files as training files and 50k files as evaluation/test files. Using a deduplication tool the original files were deduplicated resulting in 84728 training files and 42372 test files.
In the preprocessing phase files which couldn't be parsed to an AST (e.g. because the Python version was too old) were removed. This reduced the training files to 75183 AST's. From these AST's a vocabulary of tokens is built using a threshold of 20
. This means that tokens which occur in the training set more than 20 times will be added to the vocabulary. This resulted in a vocabulary of size 43853.
Shared parameters:
batch size: 64
embed dimension: 150
hidden dimension: 500
num LSTM layers: 2
lookback tokens: 100
norm clipping: 10
initial learning rate: 2e-3
learning rate schedule: decay of 0.97 every epoch
epochs: 15
Deployed code: experiments release
Data used:
Training set (1970000 items): download here
Validation set (227255 items): download here
Evaluation set (911213 items): download here
Vocabulary (size: 43853): download here
L2 parameter of 1e-6
(also done here).
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 46.61% | 71.67% |
Evaluation set | 47.89% | 69.76% |
Resulting model: final_model_experiment_1
Dropout parameter of 0.8
(based on Pythia).
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 38.53% | 63.31% |
Evaluation set | 39.37% | 61.15% |
Resulting model: final_model_experiment_2
No L2, dropout or weighted loss.
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 43.03% | 67.37% |
Evaluation set | 46.63% | 68.67% |
Resulting model: final_model_experiment_3
Includes a weighted loss + L2 (1e-6
).
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 40.53% | 63.62% |
Evaluation set | 41.31% | 60.84% |
Resulting model: final_model_experiment_4
Using a bi-directional LSTM instead of uni-directional. Also includes L2 regularizer (1e-6
).
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 48.60% | 71.49% |
Evaluation set | 49.87% | 70.11% |
Resulting model: final_model_experiment_5
Using an attention mechanism. Also includes L2 regularizer.
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 51.10% | 73.95% |
Evaluation set | 52.90% | 72.86% |
Resulting model: final_model_experiment_6
Using an attention mechanism. Also includes L2 regularizer (1e-6
) and a (lower) dropout (0.4
).
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 51.51% | 74.17% |
Evaluation set | 53.70% | 73.22% |
Resulting model: final_model_experiment_7
Includes L2 regularizer (1e-6
) and a (lower) dropout (0.4
)
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 47.31% | 71.89% |
Evaluation set | 48.56% | 70.71% |
Resulting model: final_model_experiment_8
Includes L2 regularizer (1e-6
) and a (lower) dropout (0.4
)
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 50.67% | 73.38% |
Evaluation set | 52.69%% | 72.36% |
Resulting model: final_model_experiment_9
Includes L2 regularizer (1e-6
) and a (lower) dropout (0.4
). Runs for 30 epochs.
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 52.20% | 75.12% |
Evaluation set | 54.80% | 74.51% |
Resulting model: final_model_experiment_10
Includes L2 regularizer (1e-6
) and a (lower) dropout (0.4
).
Top-1 accuracy | Top-5 accuracy | |
---|---|---|
Validation set | 47.26% | 71.96% |
Evaluation set | 48.51% | 70.60% |
Resulting model: final_model_experiment_11