tech-srl/code2vec

Pretrained model for Python

Avv22 opened this issue · 3 comments

Avv22 commented

Hello,

I have a bunch of Python ASTs and Java ASTs in the following format:

[{"id": 0, "type": "Module", "children": [1, 7, 19, 22, 38]}, {"id": 1, "type": "Assign", "children": [2, 3]}, {"id": 2, "type": "NameStore", "value": "S"}, {"id": 3, "type": "Call", "children": [4, 5]}, {"id": 4, "type": "NameLoad", "value": "list"}, {"id": 5, "type": "Call", "children": [6]}, {"id": 6, "type": "NameLoad", "value": "input"}, {"id": 7, "type": "Assign", "children": [8, 9]}, {"id": 8, "type": "NameStore", "value": "a"}, {"id": 9, "type": "Call", "children": [10, 11]}, {"id": 10, "type": "NameLoad", "value": "list"}, {"id": 11, "type": "Call", "children": [12, 13, 14]}, {"id": 12, "type": "NameLoad", "value": "map"}, {"id": 13, "type": "NameLoad", "value": "int"}, {"id": 14, "type": "Call", "children": [15]}, {"id": 15, "type": "AttributeLoad", "children": [16, 18]}, {"id": 16, "type": "Call", "children": [17]}, {"id": 17, "type": "NameLoad", "value": "input"}, {"id": 18, "type": "attr", "value": "split"}, {"id": 19, "type": "Assign", "children": [20, 21]}, {"id": 20, "type": "NameStore", "value": "factor"}, {"id": 21, "type": "Num", "value": "0"}, {"id": 22, "type": "For", "children": [23, 24, 25]}, {"id": 23, "type": "NameStore", "value": "tmp"}, {"id": 24, "type": "NameLoad", "value": "a"}, {"id": 25, "type": "body", "children": [26, 35]}, {"id": 26, "type": "Expr", "children": [27]}, {"id": 27, "type": "Call", "children": [28, 31, 34]}, {"id": 28, "type": "AttributeLoad", "children": [29, 30]}, {"id": 29, "type": "NameLoad", "value": "S"}, {"id": 30, "type": "attr", "value": "insert"}, {"id": 31, "type": "BinOpAdd", "children": [32, 33]}, {"id": 32, "type": "NameLoad", "value": "tmp"}, {"id": 33, "type": "NameLoad", "value": "factor"}, {"id": 34, "type": "Str", "value": "\\""}, {"id": 35, "type": "AugAssignAdd", "children": [36, 37]}, {"id": 36, "type": "NameStore", "value": "factor"}, {"id": 37, "type": "Num", "value": "1"}, {"id": 38, "type": "Expr", "children": [39]}, {"id": 39, "type": "Call", "children": [40, 41]}, {"id": 40, "type": "NameLoad", "value": "print"}, {"id": 41, "type": "Call", "children": [42, 45]}, {"id": 42, "type": "AttributeLoad", "children": [43, 44]}, {"id": 43, "type": "Str", "value": ""}, {"id": 44, "type": "attr", "value": "join"}, {"id": 45, "type": "NameLoad", "value": "S"}]'

How can I get their embeddings with your model please? Is their already trained model that I can used directly to output embeddings similar to your trained model for Java please or I should train the model from scratch for Python? If yes, can you please show how to start that?

Hi Avra,
Thank you for your interest in this work! Sorry again for the delayed response.

Yes, you will need to train the model from scratch for Python. See: https://github.com/tech-srl/code2vec#extending-to-other-languages

As for Java, you will either need to extract paths from your ASTs that are in the same format as our data.
Otherwise, you can de-serialize your ASTs (convert them back to code), and run our JavaExtractor on the produced code.

Best,
Uri

Avv22 commented

@urialon. Okay! So the above AST sample for one Python code does not work? I have to use your java extractor and astminer on my code samples to train them on code2vec please?

Correct.