A tiny sentence transformer + LGBM based model to find relevance match between code and docstring
Python corpus was used for training the model with 70/30 split and test size of 88k which was kept hidden.
F1-score was used for evaluation and AUC was monitored with FP/FN to avoid overfitting of the model.
- Explored different embedding but had to manage trade off between inference time and model performance
- We have used sentence-transformer to get better code and docstring representation.
Reason to go with LGBM:
- Light weight and less memory consumption
- Feature interpretation
- Faster training time