Identifying the source of the token (Program vs. Copybook) within AST?

Question

Identifying the source of the token (Program vs. Copybook) within AST?

hegdemk2004 opened this issue 4 years ago · 3 comments

hegdemk2004 commented 4 years ago

Hi Kris,
I am dumping the AST to XML using ToXML. Couple of queries regarding the pre-processing stage

I noticed is that the AST (at least the one dumped to XML) has lost the original COPY, REPLACE etc statement completely. The pre-processor "over-wrote" the COPY statement with the copybook contents. While it is syntactically OK, knowing which part of the AST / which tokens came from which component - Program or an expanded Copybook is very very crucial in code analysis / automated refactoring. Similar to the real COBOL pre-processor, it would be good if the original statement is just commented out by the pre-processor - COPY, INCLUDE, REPLACE, COPY..REPLACING etc instead of Overwriting the statement with the Copybook content
Also the very first step in code analysis is to just identify and report the missing copybooks from the source code inventory. The Parser currently does report the missing components to Sysout. Is there any information tagged in the AST itself to identify missing copybooks i.e. Copybooks that could not be resolved and expanded by the pre-processor? In the current design probably any un-expanded copybook could be treated as a missing copybook! Or if we implement point 1 of commenting out expanded copy statement, any un-commented copy statement could be used for identifying missing copybooks.
Another alternative could be to allow running of the pre-processor independent of the parser. Thats how the real COBOL compiler also works. Preprocess the Program, comment out COPY statement and expand copybook content also report any missing copybooks detected. All this before the real parsing begins.

Thanks for your excellent framework.

Answer 1 · 2021-01-17T08:37:29.000Z

So the TL;DR is that everything already pretty much works as you expect. It's just that you don't get to see that in the XML dump.

The XML dump is something I made quite early on because it seemed like it might be useful, but I have never really used it myself. Its featureset is therefore pretty much undeveloped. To make the most of Koopa, as it currently stands, you need to use Java to inspect the AST. I'm open to submissions which make the XML more useful in practice, of course.

To the questions, in a bit more detail, then:

The original statements/text are not lost. You can ask any token what it replaced, if anything. (See Token.getReplaced().) This works for COPY statements, but also for other forms of replacement.
Any unresolved COPY statements remain in the AST. If we can replace, we replace. If we can't, we leave well alone. I would therefore expect any unresolved COPY statements to still be in the XML.
Koopa is not a compiler. It is a (best effort) parser. And to do the preprocessing step correctly (even if I were to limit it to identifying COPY statements) I need to parse.

I hope that helps clear things up a bit more. If something's not clear (it's early on a Sunday for me), I'm happy to answer follow up questions. :-)

Answer 2 · 2021-02-11T08:34:20.000Z

Thanks Kris. I will definitely explore the Java way of exploring the AST.

If the XML dump of AST contains full information from the Java version of AST, I think will provide a nice way of decoupling the code analysis / transformation steps (Back end of the tool chain) from the Recognizing, Tokenizing and Parsing into AST (Front End of the Tool chain).

Answer 3 · 2021-02-11T12:31:49.000Z

I agree, but I want to leave it to someone who actually uses the XML to define the right structure for all that. Like I said, I have never really had any use for it, so that makes me a poor stakeholder for that feature.