[Question]Making use of TreeSitterLangProcessor

Question

[Question]Making use of TreeSitterLangProcessor

ziwenyd opened this issue 3 years ago · 2 comments

Hi,

I am trying to use TransCoder to translate between JavaScript and Python, so I am trying to build a JavaScript processor for the processing.py pipeline (as mentioned in #42 (comment) ) .

To bulid the processor I need a tokenizer (mentioned in #48 (comment) ), I want to use ANTLR4 which is a parer that contains a lexer.

However, I don't understand how to make use of the TreeSitterLangProcessor class. I tried to reference the java processor and cpp processor and found that I need to provide three init params (specific to one language) by inheriting the TreeSitterLangProcessor:

JAVA_TOKEN2CHAR
JAVA_CHAR2TOKEN (built by JAVA_TOKEN2CHAR)
ast_nodes_type_string

question 1 - what those init params represent?

I don't understand how to find out what value should be stored in the TOKEN2CHAR and ast_nodes_type_string for a new language(JavaScript). For example:

why "STOKEN00" refers to "//" in JAVA_TOKEN2CHAR, where does the mapping come from?
why ast_nodes_type_string in java processor has 'character_literal' while the java processor seems to call the same thing as 'char_literal'? how can I find out what to save in the ast_nodes_type_string for a JavaScript processor?

question 2 - when should I use TreeSitterLangProcessor and when not?

Why python processor didn't make use of the TreeSitterLangProcessor class? In which case it is better to use TreeSitterLangProcessor class and in which case better not?

question 3 - why I need a tokenizer given TreeSitterLangProcessor?

As mentioned in question 1, seems if I inherite my JavaScript processor from TreeSitterLangProcessor, those three init params are the only things I need to provide myself and the rest (tokenize and detokenize) is handled by the TreeSitterLangProcessor.

Then why would I need a JavaScript tokenizer (mentioned in #48 (comment) ) such as ANTLR4?

Hope I described my questions clearly and sorry that I am still confused on this after two issues regarding adding a new language.

Thanks for the awesome paper with well-structured repository, and thanks for anyone's help in advance!

Answer 1 · 2022-05-31T17:13:47.000Z

Hi,
Sorry for the late response.
While you could use Antler to tokenize your javascript code, I believe that you could get to the same result faster using TreeSitter (since most of the code would already be written).

Question 1

why "STOKEN00" refers to "//" in JAVA_TOKEN2CHAR, where does the mapping come from?

These are just special tokens that we replace before processing strings. It ensures that these tokens won't be modified by the string processing (we replace the special tokens back with the original tokens afterwards).

why ast_nodes_type_string in java processor has 'character_literal' while the java processor seems to call the same thing as 'char_literal'? how can I find out what to save in the ast_nodes_type_string for a JavaScript processor?

Unfortunately, the node names in TreeSitter are not the same across languages. You would need to find the corresponding node names in the grammar of tree-sitter-javascript: https://github.com/tree-sitter/tree-sitter-javascript/blob/master/grammar.js#L452

Question 2: when to use tree-sitter

I believe that using tree-sitter makes the code more generalizable and faster to implement for every language with a good tree-sitter grammar repository. For python, ensuring that we introduce no bug when switching to tree-sitter is not trivial. Hence, we kept the previous tokenizer. I want to switch it to tree-sitter but have not find the time to do it.

Question 3 - why I need a tokenizer given TreeSitterLangProcessor?

I don't think you will need a tokenizer such as ANTLR4 if you use tree-sitter. Tree-sitter should be enough to tokenize your code.

Answer 2 · 2022-09-08T06:41:02.000Z

I changed to using tree-sitter and it works great, thanks for the detailed answer and the great work done!