tech-srl/code2vec

Help for using code2vec for C/C++ language

Closed this issue · 11 comments

I know you mentioned in https://github.com/tech-srl/code2vec#extending-to-other-languages for extending it C/C++ language but can you mention step-wise process for doing same.

Means should we directly use pathminer and feed it's output? what will be pathminers input? It willbe really helpful if you tell me how can we substitute JavaExtractor with pathminer. https://github.com/vovak/astminer

It will be also helpful if you know any repository which already extended code2vec for C/C++

Hi @shubhamrsangle ,
Thank you for your interest in code2vec.

I have not tried PathMiner myself, but I know that many people have used it with code2vec for C/C++/Python.

I am guessing that PathMiner does only the step that is equivalent to JavaExtractor (i.e., only up to this line), but then you need to run the next preprocessing steps that begin here.

Let me know how it goes.
Best,
Uri

Hello @urialon,

Output of the ASTMINER is four files which are paths.csv, nodes.csv, tokens.csv, path_context.csv.

But As you mentioned in earlier comment, according to that output of this was supposed to be single raw.txt file.

As I am getting different output, I can't proceed with further steps mentioned earlier.

Can you please give at-least some details so we can proceed further? No where it's mentioned that how this can be used with C/C++ and I didn't find repository doing so.

I'm sorry, I don't know.
According to their README it seems that path_contexts.csv is the file that you need.
Did you try to ask the authors of PathMiner or look in their README?

Hi @shubhamrsangle , I am sorry to ask again in a closed issue. Could you find any further information on how to apply code2vec for C/C++ source code?

Dear @urialon ,
I have also been following code2vec paper for a while and really admire your presentation. I am working on a project that needs to embed C/C++ functions as vectors for machine learning models. So far, code2vec is the closet method that I found that I could try out on my project. I would really appreciate it if you can give further detailed examples or instructions to extend your work to C/C++.

Thank you for reading my comment.

Hi,
Please see https://github.com/tech-srl/code2vec#extending-to-other-languages and let me know if you have any specific questions.

You can simply look at the generated data (by running preprocess.sh before it calls the preprocess.py script) and generate data in the same format instead.

Dear @urialon
Thank you very much for your instant reply,
I understand until the step that I need I to use ASTminer to get path_contexts.csv. Here, my task does not require to predict the C/C++ function names, it is a binary classification task. I think that code2vec can help for the feature extraction step as it potentially provide function representation vector with more information.
Could u explain a bit further how to get a fixed length vector from path_contexts.csv (representation of a C/C++ function) using code2vec method ?
Thank you for your time.

Hi,
I don't know how the file path_contexts.csv looks like, this is a file that is produced by ASTMiner. I did not create ASTMiner, did not use it, and cannot support it.

If I understand correctly, the path_contexts.csv file is equivalent to my *.raw.txt files. Try to follow preprocess.sh and plug in the path_contexts.csv files into the preprocess.py files, instead of the files that were produced by our JavaExtractor.
Then, you will probably need to train the model again.

See this thread: #72

Best,
Uri

Thank you very much for your quick reply,
I will check that and let you know know the results soon.
Best regards
Hai

Hello @anhhaibkhn

I followed followng steps:
1: Downloaded ASTMINER cli.jar file and named it as cli.jar
2: Then changed preprocess.sh file in code2vec-master. I am attaching my new preprocess.sh, you can add your paths to it and then check.

https://gist.github.com/shubhamrsangle/0f7ecc04a04d3371d22b19321b1ec547

I hope this helps.