Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project. The model training pipeline was based on the implementation in CodeSearchNet repository. Python, Java, Go, Php, Javascript, and Ruby programming language are supported.
BPE tokenizer is used to encode both code strings and query strings(docstrings are used as queries in training). Code strings are padded and encoded to the length of 200 tokens. Query strings are padded and encoded to the length of 30 tokens. Both code embedding size and query embedding size are 256. Token embeddings are masked and then an unweighted mean is performed to get a vector with 256 dimensions for code strings and query strings. Cosine similarity is calculated between the code representations and the query representations. Further details can be found on the WANDB run
- Deep Structured Semantic Model
- Wide & Deep Learning
Python package with scripts to prepare the data, train/test the model and predict.
We use the data from the CodeSearchNet project. The downloaded data is around 20GB. For more details, please follow this link.
To install the reqiured dependencies
pip3 install -r requirements.txt
Data preparation step is seperated from the training step because of computing time and memory consumption.
Start the training
python3 -m train --model neuralbow_v1
The model will be trained for each language. The evaluation metric is MRR for validation and test sets, however, the output of prediction will be evaluated by GitHub using nDCG.
Predict
python3 predict.py -r wuchen/SemanticCodeSearch/1fpfl6dq
- Requirements: Flask
- Import source code file
- Running the dev server
python3 -m server.main
cd react-code-search
npm install
npm start
Please cite as:
@article{clement2021distilling,
title={Distilling Transformers for Neural Cross-Domain Search},
author={Clement, Colin B and Wu, Chen and Drain, Dawn and Sundaresan, Neel},
journal={arXiv preprint arXiv:2108.03322},
year={2021}
}
@article{wu2022learning,
title={Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus},
author={Wu, Chen and Yan, Ming},
journal={arXiv preprint arXiv:2201.11313},
year={2022}
}