Request help for exporting the vectors for the method code

Question

Request help for exporting the vectors for the method code

Sichengluis opened this issue 3 years ago · 4 comments

Hi Sir,
Thank you for your great work on code2vec, it really helped me a lot!

I am trying to create an AI model to predict whether a method has code smell, it is a binary classification problem. Since all my data is written in Java, I have used code2vec to represent the method code as a vector. I have used your model directly in two different ways and I have the following questions.

The first way is to export the embedding of each token directly using your model. I am now using your trained model directly, and exported the embedding (tokens.txt) of each token through it (python3 code2vec.py --load models/java14_model/saved_model_iter8.release --save_w2v models/java14_model/tokens.txt) I applied the file as a vocabulary to represent each token of my own method (non-existent tokens are treated as PAD_OR_OOV).
The second way I tried was to use your model directly to derive the vectors corresponding to each method. I am now trying to export the vectors corresponding to each of my methods using your model, I used the following command code2vec.py --load models/java14_model/saved_model_iter8.release --export_code_vectors --test methods_ oneline.txt, to do it. There are thousands of lines in methods_oneline.txt, each line is a method, but I always get the following error.

I know it's better to train a model from scratch with my own data, but I'd like to use your model directly. Do you have any suggestions for me? I'm new in AI, how can I do better in the case of using your model directly?

Thank you and your team in advance and sorry if my question was not clearly expressed or too native.

Answer 1 · 2022-01-22T04:26:56.000Z

Hi @Sichengluis ,
Thank you for your interest in our work!

I think that the only issue is that the file methods_oneline.txt needs to be a file that was preprocessed using our preprocess.sh pipeline. It seems that it might have passed through the JavaExtractor step here: https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L42 ,

but the output might not have passed through this step https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L60 ?

Let me know if I misunderstood your question.
Best,
Uri

Answer 2 · 2022-03-09T02:55:52.000Z

Hi @urialon ,

Thank u for your reply!

It is a little bit hard for me to use it in the second way. I wonder if the first way I mentioned above is feasible. Can I use the vocabulary to get the embedding of each word directly like using word2vec?

Best,
Luis

Answer 3 · 2022-03-09T03:09:58.000Z

Yes. But you don't need to export anything, you can just download the vocabularies. See: https://github.com/tech-srl/code2vec#exporting-the-trained-token-vectors-and-target-vectors

Best,
Uri

Answer 4 · 2022-03-09T03:24:19.000Z

Great! Thank you again!