Integrating astminer with code2vec for C source codes
RichardZapanta opened this issue · 13 comments
Hello!
I was able to extract path_contexts.c2s
file using astminer. However, my goal is to extract code vectors from the given source codes. With the path_contexts.c2s
, I don't know how to integrate it with code2vec. May I ask what will be the next steps and what are the needed files that I need to modify?
Thank you!
Hi @RichardZapanta ,
Thank you for your interest in our work.
Did you see these sections of the README?
https://github.com/tech-srl/code2vec#extending-to-other-languages
and
https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples
Did you see also this: #60 ?
Best,
Uri
Hi @urialon!
I was able to go over these sections of the README and also checked some of the issues encountered before. Based from what I understand, I need to train the model first using C source codes in order to export its code vector, is this correct?
If ever, is there a way to export the code vectors of C source codes without training the model with our dataset?
Thank you!
Hi @RichardZapanta ,
Yes, you will need a model that was trained on C data first.
The code vectors are meaningless without training.
Best,
Uri
Hi @urialon !
I was able to successfully preprocess my data with the use of astminer and by modifying the preprocess.sh
. It was able to prompt me this in the terminal
The data
folder was able to produce these files.
Then, I proceed with training the model from scratch. I have changed values in the train.sh
. However, I encountered an error of
IndexError: list index out of range
This is how my current train.sh
file looks like
See below is the screenshot from the terminal
The data
folder also produce 2 more files
I only edit preprocess.sh
and train.sh
and leave the rest untouched. What are the possible workarounds to fix this issue?
Thank you!
Hi,
let's try to skip the "filter impossible names", by just replacing this line:
https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L460
with:
prediction = top_words[0]
Let me know how it goes.
Best,
Uri
Hi @urialon,
I encountered a new error after doing the said instruction, the new error said
ZeroDivisionError: float division by zero
on line 493 of tensorflow_model.py
. See below is the screenshot.
Thank you!
Hi @RichardZapanta , I just fixed that, can you please pull again?
Thank you for reporting this!
Hi @urialon,
Thank you very much for this. I able to train the model, however, I have some concerns.
Thank you very much for this. I was able to train the model, however, I have some concerns.
- Will we be able to export code vectors correctly if ever we get a very low validation results (precision, F1 and recall)
- Is there a way to change the directory of the input code instead of
Input.java
- Lastly, is there a way to export code vectors into a text file (.txt)
- Which file will be our final model after training (See the available files below)
Thank you very much once again!
Hi @RichardZapanta ,
Here are some answers to your questions:
Will we be able to export code vectors correctly if ever we get a very low validation results (precision, F1 and recall)
Technically yes, but these vectors might not be that "good" for downstream tasks.
Is there a way to change the directory of the input code instead of Input.java
Yes, here: https://github.com/tech-srl/code2vec/blob/master/interactive_predict.py#L29
Lastly, is there a way to export code vectors into a text file (.txt)
Yes, see https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples
Which file will be our final model after training (See the available files below)
it is your choice, but the common practice is to take the one that has got the best validation accuracy according to the training logs.
Best,
Uri
Hi @urialon ,
Thank you for the quick response. We encountered some problems when trying to extract code vectors from C source codes. After training the model from scratch using source train.sh
, we got very low (almost 0) validation results and also got zero when evaluating the trained model, this is because the data that we used to train the model are multiple C source codes that solve the same problem since our study is focusing on identifying source code similarity
We did the following steps
- preprocess our data using
source preprocess.sh
with the help of astminer - train the model using
source train.sh
, and obtain these files
- We release the model of the 2nd iteration, having this result
After 2 epochs -- top10_acc: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], precision: 0.09090909090909091, recall: 0.09090909090909091, F1: 0.09090909090909091
and obtain these three files
- We change the input file and have these lines of code as input
- We run
python3 code2vec.py --load models/AcerTrial/saved_model_iter2.release --predict --export_code_vectors
and obtain this error
With that being is said, are there any steps that we did wrong or skipped? and what can we do to fix this issue? or are there any data files that we need so that the model can interpret the input file as C source codes and not java files?
Thank you!
P.S. accidentally closed the issue. Apologies for this.
Hi @RichardZapanta ,
It seems that the model did not learn anything useful.
You can either train longer (why only 2 epochs?), or try code2seq.
Additionally, using the --predict
option will not work on C code, because it expects to parse Java.
Best,
Uri
Hi @RichardZapanta ,
We just released a model that in C works better than OpenAI's Codex.
https://arxiv.org/pdf/2202.13169.pdf
https://github.com/VHellendoorn/Code-LMs
Best,
Uri