tech-srl/code2seq

how did you generate the AST tree structure through code

walt676 opened this issue · 16 comments

Hello, how did you generate the AST tree structure through code, for example, as shown on the right side of the page https://code2seq.org/
My English is not very good, I am very sorry if I offend.
Looking forward to your reply, thank you!

Hi @walt676 ,
Thank you for your interest in our work!

Do you refer to the JavaScript visualization?
It was built with React, D3, and react-d3-tree with some domain-specific logic for our visualization.

I hope it answers your question.
Best,
Uri

Hi, @urialon ,
Thank you very much for your reply.
Actually I am not interested in the visualization. I would like to know the process of converting the source code into AST.
In the part I mentioned above,do you use JavaParser, and finally visualize it?
Thank you again for responding to my questions while busy.

Yes,
We use the JavaParser package in our JavaExtractor to parse the source code into AST.
We use the AST mainly for training the neural model, but also for visualizing.

Thank you very much for your reply, @urialon .
I want to ask you some implementation details.
I know that in your work, you use the AST path as input and store the path in the dataset.
If I want to directly store the tree structure of AST in the dataset persistently, do you have any suggestions on what format to store.

Hi @walt676 ,
Sorry for the delayed response.

I think that if you upgrade the version of Javaparser from the pom.xml file of the JavaExtractor project, you'll be able to use the built-in serializer of Javaparser: https://javadoc.io/doc/com.github.javaparser/javaparser-core-serialization/latest/index.html

Best,
Uri

Yes,
We use the JavaParser package in our JavaExtractor to parse the source code into AST.
We use the AST mainly for training the neural model, but also for visualizing.

Hi, I wonder that whether you use the package "Microsoft.CodeAnalysis.CSharp.Syntax" to parse a C# code to get an AST tree. Thank you for your reply.

Hi @cynthia0118 ,
Thank you for your interest in our work.

Yes, in C#, we use this package, as you can see here:
https://github.com/tech-srl/code2seq/blob/master/CSharpExtractor/CSharpExtractor/Extractor/Extractor.cs#L4

Best,
Uri

Hi @walt676 ,
Sorry for the delayed response.

I think that if you upgrade the version of Javaparser from the pom.xml file of the JavaExtractor project, you'll be able to use the built-in serializer of Javaparser: https://javadoc.io/doc/com.github.javaparser/javaparser-core-serialization/latest/index.html

Best,
Uri

Thank you very much for answering my question. My problem has been resolved and this issue will be closed temporarily.
Thank you again, Uri!

Hi @urialon ,
Sorry to bother you again,
I have a definition that I don’t quite understand:
In JavaExtractor which you used to generate AST path, what is 'Generic Parent'?
And I see you show AST of a code snippet in 'code2seq.org', could you share the code you used to generate AST?
Thank you for checking this issue.

Hi @walt676 ,
The "Generic parent" is simply a node that represents a generic type, like ArrayList<String>.

What do you mean by "generate AST"?
In the JavaParser, we parse the input code and we get an object that holds the AST here:
https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/FeatureExtractor.java#L68

Are you interested in this object, or in serializing it as JSON?
If so, did you check my previous response about the built-in serializer of Javaparser?https://javadoc.io/doc/com.github.javaparser/javaparser-core-serialization/latest/index.html

Uri

Hi @urialon ,
I modified your code according to the required data format,but some errors occurred during use, and this problem also occurred when running your original version of the code.
When I used bash preprocess.sh and only generated raw data(xxxx.raw.txt), these error messages will appear:

[root@localhost code2seq-master]# bash preprocess.sh 
Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: /opt/project/code2seq-master/java-small/training/cassandra was not completed in time
dir: /opt/project/code2seq-master/java-small/training/intellij-community was not completed in time
dir: /opt/project/code2seq-master/java-small/training/liferay-portal was not completed in time
dir: /opt/project/code2seq-master/java-small/training/hibernate-orm was not completed in time
dir: /opt/project/code2seq-master/java-small/training/wildfly was not completed in time
dir: /opt/project/code2seq-master/java-small/training/elasticsearch was not completed in time
b'java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "{" "{"\n    at line 1, column 53.\n\nWas expecting one of:\n\n    "@"\n    <IDENTIFIER>\n\n\n\tat java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1541)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "{" "{"\n    at line 1, column 53.\n\nWas expecting one of:\n\n    "@"\n    <IDENTIFIER>\n\n\n\tat com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)\n\tat com.github.javaparser.JavaParser.parse(JavaParser.java:210)\n\tat JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:66)\n\tat JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:38)\n\tat JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:64)\n\tat JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:35)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:28)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:17)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\njava.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "package" "package"\n    at line 1, column 20.\n\nWas expecting one of:\n\n    ";"\n    "<"\n    "@"\n    "abstract"\n    "boolean"\n    "byte"\n    "char"\n    "class"\n    "default"\n    "double"\n    "enum"\n    "final"\n    "float"\n    "int"\n    "interface"\n    "long"\n    "native"\n    "private"\n    "protected"\n    "public"\n    "short"\n    "static"\n    "strictfp"\n    "synchronized"\n    "transient"\n    "void"\n    "volatile"\n    "{"\n    "}"\n    <IDENTIFIER>\n\n\n\tat java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1541)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "package" "package"\n    at line 1, column 20.\n\nWas expecting one of:\n\n    ";"\n    "<"\n    "@"\n    "abstract"\n    "boolean"\n    "byte"\n    "char"\n    "class"\n    "default"\n    "double"\n    "enum"\n    "final"\n    "float"\n    "int"\n    "interface"\n    "long"\n    "native"\n    "private"\n    "protected"\n    "public"\n    "short"\n    "static"\n    "strictfp"\n    "synchronized"\n    "transient"\n    "void"\n    "volatile"\n    "{"\n    "}"\n    <IDENTIFIER>\n\n\n\tat com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)\n\tat com.github.javaparser.JavaParser.parse(JavaParser.java:210)\n\tat JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:66)\n\tat JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:38)\n\tat JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:64)\n\tat JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:35)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:28)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:17)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\n'
Finished extracting paths from training set

And when I use java -Xmx100g -XX:MaxNewSize=60g -cp JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --dir /opt/project/code2seq-master/java-small/training/gradle --max_path_length 8 --max_path_width 2 --num_threads 6 would report error after printing some successfully generated data:

java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "{" "{"
    at line 1, column 53.

Was expecting one of:

    "@"
    <IDENTIFIER>


	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at JavaExtractor.App.lambda$extractDir$3(App.java:59)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
	at JavaExtractor.App.extractDir(App.java:57)
	at JavaExtractor.App.main(App.java:32)
Caused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "{" "{"
    at line 1, column 53.

Was expecting one of:

    "@"
    <IDENTIFIER>


	at com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)
	at com.github.javaparser.JavaParser.parse(JavaParser.java:210)
	at JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:77)
	at JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:49)
	at JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:65)
	at JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:35)
	at JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:28)
	at JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:17)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "package" "package"
    at line 1, column 20.

Was expecting one of:

    ";"
    "<"
    "@"
    "abstract"
    "boolean"
    "byte"
    "char"
    "class"
    "default"
    "double"
    "enum"
    "final"
    "float"
    "int"
    "interface"
    "long"
    "native"
    "private"
    "protected"
    "public"
    "short"
    "static"
    "strictfp"
    "synchronized"
    "transient"
    "void"
    "volatile"
    "{"
    "}"
    <IDENTIFIER>


	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at JavaExtractor.App.lambda$extractDir$3(App.java:59)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
	at JavaExtractor.App.extractDir(App.java:57)
	at JavaExtractor.App.main(App.java:32)
Caused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "package" "package"
    at line 1, column 20.

Was expecting one of:

    ";"
    "<"
    "@"
    "abstract"
    "boolean"
    "byte"
    "char"
    "class"
    "default"
    "double"
    "enum"
    "final"
    "float"
    "int"
    "interface"
    "long"
    "native"
    "private"
    "protected"
    "public"
    "short"
    "static"
    "strictfp"
    "synchronized"
    "transient"
    "void"
    "volatile"
    "{"
    "}"
    <IDENTIFIER>


	at com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)
	at com.github.javaparser.JavaParser.parse(JavaParser.java:210)
	at JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:77)
	at JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:49)
	at JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:65)
	at JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:35)
	at JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:28)
	at JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:17)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

I had increased timeout here :
https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/extract.py#L37
But it didn't solve the problem.
Sorry to interrupt you, and look forward to your reply. Thank you !

Hi @urialon ,
I also think this is the reason.
I used the same dataset and JavaParser version as you.
Did this error occur when you are processing this file?
I will ignore these problematic java files and deal with other normal ones, Thank you!

Hi,
Yes, in my preprocessing some files raised a parsing error which I ignored, yes.
You can verify that the final number of lines in the file is (roughly) as expected.

Best,
Uri

Thank you for your detailed reply!