/icse-2020

Meta-repo for our submission "Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code" (contains links to repos and artifacts, code for building artifact demonstartion docker image, poster)

Primary LanguageJavaApache License 2.0Apache-2.0

DOIs of the Artifacts

DOI Artifact
Java corpus https://doi.org/10.7488/ds/1690
C corpus https://doi.org/10.5281/zenodo.3628775
Python corpus https://doi.org/10.5281/zenodo.3628784
Java, pre-processed https://doi.org/10.5281/zenodo.3628665
C, pre-processed https://doi.org/10.5281/zenodo.3628638
Python, pre-processed https://doi.org/10.5281/zenodo.3628636
Trained models https://doi.org/10.5281/zenodo.3628628

Code used to run experiments

Codeprep library (for vocabulary study): https://github.com/giganticode/codeprep

Open-vocabulary Neural LM: https://github.com/mast-group/OpenVocabCodeNLM

Paper

If you jse the artifacts, please cite the paper:

@article{karampatsis2020big,
 title={Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code},
 author={Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea},
 journal={arXiv preprint arXiv:2003.07914},
 year={2020}
}