tech-srl/code2vec

Request help for exporting the vectors for the method code

Sichengluis opened this issue · 4 comments

Hi Sir,
Thank you for your great work on code2vec, it really helped me a lot!

I am trying to create an AI model to predict whether a method has code smell, it is a binary classification problem. Since all my data is written in Java, I have used code2vec to represent the method code as a vector. I have used your model directly in two different ways and I have the following questions.

  1. The first way is to export the embedding of each token directly using your model. I am now using your trained model directly, and exported the embedding (tokens.txt) of each token through it (python3 code2vec.py --load models/java14_model/saved_model_iter8.release --save_w2v models/java14_model/tokens.txt) I applied the file as a vocabulary to represent each token of my own method (non-existent tokens are treated as PAD_OR_OOV).

  2. The second way I tried was to use your model directly to derive the vectors corresponding to each method. I am now trying to export the vectors corresponding to each of my methods using your model, I used the following command code2vec.py --load models/java14_model/saved_model_iter8.release --export_code_vectors --test methods_ oneline.txt, to do it. There are thousands of lines in methods_oneline.txt, each line is a method, but I always get the following error.
    CPALWYP{$JO `5{I@TWVIG5

I know it's better to train a model from scratch with my own data, but I'd like to use your model directly. Do you have any suggestions for me? I'm new in AI, how can I do better in the case of using your model directly?

Thank you and your team in advance and sorry if my question was not clearly expressed or too native.

Hi @Sichengluis ,
Thank you for your interest in our work!

I think that the only issue is that the file methods_oneline.txt needs to be a file that was preprocessed using our preprocess.sh pipeline. It seems that it might have passed through the JavaExtractor step here: https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L42 ,

but the output might not have passed through this step https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L60 ?

Let me know if I misunderstood your question.
Best,
Uri

Hi @urialon ,

Thank u for your reply!

It is a little bit hard for me to use it in the second way. I wonder if the first way I mentioned above is feasible. Can I use the vocabulary to get the embedding of each word directly like using word2vec?

Best,
Luis

Yes. But you don't need to export anything, you can just download the vocabularies. See: https://github.com/tech-srl/code2vec#exporting-the-trained-token-vectors-and-target-vectors

Best,
Uri

Great! Thank you again!