mbj/unparser

Errors

marcandre opened this issue · 17 comments

A few files that generate errors with unparser, but have ok syntax according to ruby -c:

https://github.com/acrmp/foodcritic/blob/master/features/support/cookbook_helpers.rb
https://github.com/acrmp/foodcritic/blob/master/lib/foodcritic/notifications.rb
https://github.com/acrmp/foodcritic/blob/master/lib/foodcritic/rules/fc022.rb
https://github.com/acrmp/foodcritic/blob/master/lib/foodcritic/rules/fc025.rb
https://github.com/acrmp/foodcritic/blob/master/lib/foodcritic/rules/fc044.rb
https://github.com/acrmp/foodcritic/blob/master/lib/foodcritic/rules/fc047.rb
https://github.com/activeadmin/activeadmin/blob/master/features/step_definitions/filesystem_steps.rb
https://github.com/activemerchant/active_merchant/blob/master/lib/active_merchant/billing/credit_card.rb
https://github.com/activemerchant/active_merchant/blob/master/lib/active_merchant/billing/gateways/balanced.rb
https://github.com/activemerchant/active_merchant/blob/master/lib/active_merchant/billing/response.rb
https://github.com/activemerchant/active_merchant/blob/master/test/remote/gateways/remote_inspire_test.rb
https://github.com/activemerchant/active_merchant/blob/master/test/unit/gateways/eway_rapid_test.rb
https://github.com/activemerchant/active_merchant/blob/master/test/unit/gateways/first_pay_test.rb
https://github.com/activemerchant/active_merchant/blob/master/test/unit/gateways/payment_express_test.rb
https://github.com/ahoward/open4/blob/master/samples/jesse-caldwell.rb
https://github.com/airbrake/airbrake/blob/master/spec/integration/sinatra/sinatra_spec.rb
https://github.com/alexreisner/geocoder/blob/master/lib/geocoder/calculations.rb
https://github.com/alexreisner/geocoder/blob/master/lib/geocoder/lookups/maxmind_geoip2.rb
https://github.com/asciidoctor/asciidoctor/blob/master/lib/asciidoctor/converter/manpage.rb
https://github.com/asciidoctor/asciidoctor/blob/master/lib/asciidoctor/syntax_highlighter/highlightjs.rb
https://github.com/asciidoctor/asciidoctor/blob/master/test/blocks_test.rb
https://github.com/asciidoctor/asciidoctor/blob/master/test/manpage_test.rb
https://github.com/asciidoctor/asciidoctor/blob/master/test/syntax_highlighter_test.rb
https://github.com/asciidoctor/kramdown-asciidoc/blob/master/lib/kramdown-asciidoc/cli.rb
https://github.com/asciidoctor/kramdown-asciidoc/blob/master/spec/cli_spec.rb
https://github.com/aws/aws-sdk-ruby/blob/master/build_tools/aws-sdk-code-generator/lib/aws-sdk-code-generator/resource_batch_builder.rb
https://github.com/aws/aws-sdk-ruby/blob/master/build_tools/aws-sdk-code-generator/lib/aws-sdk-code-generator/resource_client_request_documentation.rb
https://github.com/aws/aws-sdk-ruby/blob/master/build_tools/aws-sdk-code-generator/lib/aws-sdk-code-generator/views/types_module.rb
https://github.com/aws/aws-sdk-ruby/blob/master/build_tools/aws-sdk-code-generator/spec/spec_helper.rb
https://github.com/aws/aws-sdk-ruby/blob/master/gems/aws-sdk-core/lib/aws-sdk-core/errors.rb
https://github.com/CanCanCommunity/cancancan/blob/master/spec/cancan/model_adapters/active_record_adapter_spec.rb
https://github.com/CanCanCommunity/cancancan/blob/master/spec/cancan/model_adapters/has_and_belongs_to_many_spec.rb
https://github.com/CapnKernul/minitest_reporters/blob/master/lib/minitest/reporters/mean_time_reporter.rb
https://github.com/CapnKernul/minitest_reporters/blob/master/test/fixtures/junit_filename_bug_example_test.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/command/init.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/command/repo/update.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/command/spec/create.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/command/spec/lint.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/generator/copy_resources_script.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/generator/copy_xcframework_script.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/generator/embed_frameworks_script.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/generator/umbrella_header.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/installer.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/installer/user_project_integrator/target_integrator/xcconfig_integrator.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/installer/xcode/pods_project_generator/pod_target_installer.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/resolver.rb
https://github.com/CocoaPods/CocoaPods/blob/master/lib/cocoapods/sources_manager.rb
https://github.com/CocoaPods/CocoaPods/blob/master/spec/spec_helper/webmock.rb
https://github.com/CocoaPods/CocoaPods/blob/master/spec/unit/user_interface/inspector_reporter_spec.rb
https://github.com/Compass/compass_rails/blob/master/test/helpers/rails_project.rb
https://github.com/RiotGames/berkshelf/blob/master/features/step_definitions/berksfile_steps.rb
https://github.com/TwP/logging/blob/master/lib/logging/layouts/pattern.rb
https://github.com/TwP/logging/blob/master/lib/logging/logger.rb
https://github.com/TwP/logging/blob/master/test/layouts/test_json.rb
https://github.com/TwP/logging/blob/master/test/layouts/test_yaml.rb
https://github.com/TwP/turn/blob/master/lib/turn/components/case.rb
https://github.com/TwP/turn/blob/master/lib/turn/reporters/cue_reporter.rb
https://github.com/TwP/turn/blob/master/lib/turn/reporters/outline_reporter.rb
Edit: restricted to non UTF-8 issues

mbj commented

A few files that generate errors with unparser, but have ok syntax according to ruby -c:

I've to go with ok syntax according to ruby-parser. Nevertheless there are valid cases there.

A few files that generate errors with unparser, but have ok syntax according to ruby -c:

I've to go with ok syntax according to ruby-parser. Nevertheless there are valid cases there.

Yes, indeed. Maybe ruby-parse should have a -c option? In any case, apart from encoding issues I'd be curious of any cases of ruby-parser having troubles with any of these; they are all from very popular gems.

mbj commented

@marcandre All of them (past the earlier 2 you reported that where easy to fix) seem to be issues from the string / dstring madness.

Most of them are triggered by this edge case I intentionally unparsed "this way":

Original-Source:
'a\
b'

Generated-Source:
"a\\\n" "b"
Original-Node:
(dstr
  (str "a\\\n")
  (str "b"))
Generated-Node:
(dstr
  (str "a\\\n")
  (str "b"))
Success: x.rb

The problem is that there are AST constructs that are "only" reachable via dstring segments "segment-a" "segment-b" or via changing the delimiter to '.

The workaround to "create more segments to unparse dstrings" triggers invalid round trips (but same semantics so far) for the extended test cases you found.

It appears that this workaround to avoid "detecting the 3rd class of string flavor in the AST" may not be worth the hazzle.

So I may have to change unparser to detect ' delimiter to reduce the number of segments. Meh.

It's not clear to me what the issue is, but I trust it should be possible to find an equivalent Ruby source code from the AST. Good luck 👍

mbj commented

It's not clear to me what the issue is, but I trust it should be possible to find an equivalent Ruby source code from the AST. Good luck

There is enough ruby code to match all ASTs. The challenge is supporting them all at once without going insane ;)

Let me try to explain it, as this activity will make it more clear for my mind.

Basically: I used to "cheat" on dstr nodes that do not have interpolation, and that cheat backfires with the examples you found.

This ruby file:

'a\
b'

which does NOT contain an interpolation, and does NOT represent an syntax level contatentation (as "foo" "bar" does), still generates a dstr in the AST:

(dstr
  (str "a\\\n")
  (str "b"))

I used to unparse all of "non intrpolating dynstrings" like the concatenation, including the case I cited above. And this "hack" (Which makes the code far easier) causes the issues you reported in this issue (which so far are 90% about string issues, where the string unparse would when evaluated be semantically equivalent but not at the AST level).

I basically have to undo this hack, and emit that string above in singlequotes (not syntax level concatenation) and than can set better segment brakes on literal (non interpolating) dynstrings that removes the issues.

mbj commented

@marcandre I've investigated almost all of the errors by now and they are all "dstring mangling" I talked about earlier.

I suspect it may be easier to change parser to produce "compact/minimized" dstrings instead of making unparser aware of all the quirks.

Talking over in the parser repository for possible AST improvements.whitequark/parser#765 (comment)

If the AST where only encoding semantics and not syntax leftovers the problems you reported here would disappear. I suspect its easier for me to contribute AST normalization to parser than to fix it on unparsers side.

Most of them are triggered by this edge case

Hmm, this is much harder to fix because that's what is emitted by the lexer:

bin/ruby-parse --30 -E test.rb
'a\
^ tSTRING_BEG "'"                               expr_end     [0 <= cond] [0 <= cmdarg]
'a\
 ^ tSTRING_CONTENT "a\\\n"                      expr_end     [0 <= cond] [0 <= cmdarg]
b'
^ tSTRING_CONTENT "b"                           expr_end     [0 <= cond] [0 <= cmdarg]
b'
 ^ tSTRING_END "'"                              expr_end     [0 <= cond] [0 <= cmdarg]
b'
  ^ tNL nil                                     line_begin   [0 <= cond] [0 <= cmdarg]

^ false "$eof"                                  line_begin   [0 <= cond] [0 <= cmdarg]
(dstr
  (str "a\\\n")
  (str "b"))

^ tSTRING_CONTENT is emitted twice. I believe it's a common rule of parser to emit multi-line strings line-by-line to ship source maps for line ends. Otherwise linters would have troubles in basic rules like line length validators.

mbj commented

@iliabylich I'm totally fine to use "consecutive" str nodes inside dstr as a hint for multi line / heredoc etc, even in the absence of interpolation.

But we could potentially collapse some of the odd cases (disussed above) that would flatten the mental complexity courve in unparser in the area already.

There are situations where parser right now produces different AST for the same semantics on dstrings. If we cut these cases down I'm fine.

I'm not sure if I understand the point of reducing AST for some cases when users still need all cases. Could you explain it please? Yes, it would be easier to support simple cases (because they'll have simpler AST), but to support 100% of AST patterns you need to handle complex AST anyway.

Also, (just out of curiosity) may I ask how do you perform unparsing by not looking at source maps of string literals? For example:

"foo'bar"
'foo"bar"
%q|foo'"bar|
%q{foo'"|bar}

Each string terminator is used to read string content in a very specific way (i.e. to treat chars that are not equal to terminator as a string content, " reads all chars except " etc) and at the same time there's no "universal" terminator.

There are situations where parser right now produces different AST for the same semantics on dstrings. If we cut these cases down I'm fine.

It definitely should be fixed if they are actually the same. Please, don't hesitate to create an issue 😄 I remember what you wrote about being busy, but we have a month, so it can wait.

mbj commented

I'm not sure if I understand the point of reducing AST for some cases when users still need all cases. Could you explain it please?

Simply managing the complexity budget. The core constraint of unparser is that it "can never look at source maps". And produce concrete syntax that when parsed: Returns the same AST.

If there is a group of ASTs that all represent the identical execution semantics, but can only be produced from different concrete syntax unparser has to special case, to select the concrete syntax that fits it. This is common for unparsers domain and expected.

The problem occurs if there group of becomes to-large. And the most complex group in parsers current AST format is dynamic strings.

The mental overhead to deal with the various edge cases grows exponentially. So if we can remove "one" special case, example %(); it may remove super linear complexity in unparser.

Each string terminator is used to read string content in a very specific way (i.e. to treat chars that are not equal to terminator as a string content, " reads all chars except " etc) and at the same time there's no "universal" terminator.

I had to remove the 2nd one as it had an unterminated string.

bundle exec unparser --verbose x.rb
x.rb
Original-Source:
"foo'bar"
%q|foo'"bar|
%q{foo'"|bar}

Generated-Source:
"foo'bar"
"foo'\"bar"
"foo'\"|bar"

Original-Node:
(begin
  (str "foo'bar")
  (str "foo'\"bar")
  (str "foo'\"|bar"))
Generated-Node:
(begin
  (str "foo'bar")
  (str "foo'\"bar")
  (str "foo'\"|bar"))
Success: x.rb

Unparser basically choses " as delimiter. And than re-escapes the payload. But it does not do this by itself it relies on rubies String#inspect to do this job.

I may emit strings without interpolation needs within ' in the future.

Edit: Basically for str unparsing is a simple as calling .inspect on that string ;) my problems are with dstr variance.

mbj commented

@iliabylich You have have misunderstood unparsers propose: Its NOT meant to reproduce your concrete syntax. Its only meant to reproduce concrete syntax that when parsed again produces the same AST.

The delimiters you gave me in an example above disappear in the AST, rightfully. As they have no execution semantics attached.

The issue I'm trying to put in writing is: There are too many ASTs in dynamic strings that have the same execution semantics but different concrete syntax. If we could remove just 10% of these complexity in unparser on dynamic strings drops significantly.

The ideal AST is: 1:1 execution semantics.

But dstr have a (too high for my pleasure) N:1 execution semantics. Reaching 1:1 is impossible (just an ideal) but reducing N is the key here.

Else the dstr is "to much" a source map by itself, and not an AST ;)

mbj commented

And just to be clear: I do not blame anyone for the current situation I describe as dstr variance. Making a free standing ruby parser was a big archievement for the ruby community. And I take it over any alternative. I'm just happy to see a way to reduce it in parser and ease my job with unparser.

I totally get it, no offense taken. I need exact code samples to talk about, and I agree that we should simplify them if possible

mbj commented

I'll get you a good list. I can source it from unparsers specs and simply commenting out special cases ;)

mbj commented

Also I'll not include multi line ones, as there is a good reason to have these for source maps.