bayesgroup/code_transformers

OOV handle

yingweima2022 opened this issue · 1 comments

Greetings!
Congrats for the great work!
Is the anonymize operation only for user-defined variable names? Will the API be anonymized? If so, will the semantics be changed?

Hi! In our implementation, code is represented via AST where each node stores type (syntactic unit of the programming language), and some nodes also store value (user-defined variables, function names, APIs etc). We anonymize all out-of-vocabulary values, thus APIs are also anonymized. Learning good embeddings and thus semantics for rare APIs is problematic, and compared to the baseline approach with replacing rare APIs with UNK placeholder, our approach allows reusing APIs inside one code snippet (as well as var names). Another possible approach would be to use byte-pair encoding, which helps not to lose semantics but has some other drawbacks discussed in the paper (https://arxiv.org/abs/2010.12663) in the Related Work section.