jgm/peg-markdown

Bold/Emphasis regression

fletcher opened this issue · 15 comments

I apologize if this is a repeat --- I thought I emailed you about this, but now cannot find that email. So perhaps I never sent it.

A while back you fixed a problem with the strong/emph parsing that could cause the program to either fail to parse or to take a long time to parse. That worked great, however it caused a regression where the following no longer parses correctly (example sent by one of my users):

The ***quick*** brown ***fox*** jumped

The problem appears to be introduced in the 2a19a38 commit, and remains in the latest commit of peg-markdown.

Thanks!

Fletcher

jgm commented

It's very unlikely that the problem is from 2a19a38, which doesn't
have anything to do with emphasis and simply removes a clause
that should be redundant.

Have you confirmed that the version immediately prior to that commit
lacks the problem, and the version with that commit has it?

John

+++ Fletcher T. Penney [Jun 12 13 05:01 ]:

I apologize if this is a repeat --- I thought I emailed you about this,
but now cannot find that email. So perhaps I never sent it.

A while back you fixed a problem with the strong/emph parsing that
could cause the program to either fail to parse or to take a long time
to parse. That worked great, however it caused a regression where the
following no longer parses correctly (example sent by one of my users):
The _quick_ brown _fox_ jumped

The problem appears to be introduced in the [1]2a19a38 commit, and
remains in the latest commit of peg-markdown.

Thanks!

Fletcher


Reply to this email directly or [2]view it on GitHub.
[xJAuenYDiIoVt3LF3y68493iILMdls1w80D0BaSbtW8iiW09MXg2qoOe6rS_8Q_z.gif]

References

  1. 2a19a38
  2. #25

Yes. That's why I wrote in and included the builds. :)

The change you made to strong/emphasis fixed the problem you wanted to fix,
but broke this.

F

Sent from my iPhone

On Jun 12, 2013, at 12:57 PM, John MacFarlane notifications@github.com
wrote:

It's very unlikely that the problem is from 2a19a38, which doesn't
have anything to do with emphasis and simply removes a clause
that should be redundant.

Have you confirmed that the version immediately prior to that commit
lacks the problem, and the version with that commit has it?

John

+++ Fletcher T. Penney [Jun 12 13 05:01 ]:

I apologize if this is a repeat --- I thought I emailed you about this,
but now cannot find that email. So perhaps I never sent it.

A while back you fixed a problem with the strong/emph parsing that
could cause the program to either fail to parse or to take a long time
to parse. That worked great, however it caused a regression where the
following no longer parses correctly (example sent by one of my users):
The _quick_ brown _fox_ jumped

The problem appears to be introduced in the [1]2a19a38 commit, and
remains in the latest commit of peg-markdown.

Thanks!

Fletcher


Reply to this email directly or [2]view it on GitHub.
[xJAuenYDiIoVt3LF3y68493iILMdls1w80D0BaSbtW8iiW09MXg2qoOe6rS_8Q_z.gif]

References

2a19a38
2. #25


Reply to this email directly or view it on
GitHubhttps://github.com//issues/25#issuecomment-19339446
.

jgm commented

Did you look at the commit you referenced? It has to do with
links, not strong/emph.

+++ Fletcher T. Penney [Jun 12 13 10:00 ]:

Yes. That's why I wrote in and included the builds. :)
The change you made to strong/emphasis fixed the problem you wanted to
fix,
but broke this.
F
Sent from my iPhone
On Jun 12, 2013, at 12:57 PM, John MacFarlane
notifications@github.com
wrote:
It's very unlikely that the problem is from 2a19a38, which doesn't
have anything to do with emphasis and simply removes a clause
that should be redundant.
Have you confirmed that the version immediately prior to that commit
lacks the problem, and the version with that commit has it?
John
+++ Fletcher T. Penney [Jun 12 13 05:01 ]:

I apologize if this is a repeat --- I thought I emailed you about
this,
but now cannot find that email. So perhaps I never sent it.

A while back you fixed a problem with the strong/emph parsing that
could cause the program to either fail to parse or to take a long
time
to parse. That worked great, however it caused a regression where the
following no longer parses correctly (example sent by one of my
users):
The _quick_ brown _fox_ jumped

The problem appears to be introduced in the [1]2a19a38 commit, and
remains in the latest commit of peg-markdown.

Thanks!

Fletcher


Reply to this email directly or [2]view it on GitHub.

[xJAuenYDiIoVt3LF3y68493iILMdls1w80D0BaSbtW8iiW09MXg2qoOe6rS_8Q_z.gif]

References

  1. 2a19a38d075356650c59206069fa
    925028182522
  2. #25

    Reply to this email directly or view it on
    GitHub<#25 (comment)
    446>
    .


Reply to this email directly or [1]view it on GitHub.
[xJAuenYDiIoVt3LF3y68493iILMdls1w80D0BaSbtW8iiW09MXg2qoOe6rS_8Q_z.gif]

References

  1. #25 (comment)

I'll rephrase.

There is a bug in peg-markdown with the following being parsed incorrectly:

The ***quick*** brown ***fox*** jumped

In trying to be helpful and save you some trouble, I wanted to point out that this bug is actually a regression. Older versions handled it properly. Newer ones don't.

In trying to be more helpful, I wanted to point out that this regression appears to have occurred when you last "fixed" strong/emph parsing. Your fix did patch the problem you wanted to solve (in the examples I had noticed anyways), but caused this regression.

In trying to be even more helpful, I pasted the commit number related to your previous fix (though you certainly are familiar with your own code and probably would have found it on your own when you looked). I inadvertently selected one row higher than intended when copy/pasting. So you had to look one commit earlier. I apologize for being off by one row.

So that you don't have any further difficulty:

  • The commit that introduced the regression: 39dfc1c
  • One commit earlier, where things still work: 482a0ba
  • One commit later, where things are still broken: 2a19a38

Hopefully that is more clear.

Thanks!

Fletcher

jgm commented

I don't have much time to work on peg-markdown now, and I
don't use it for anything myself. So you have to understand that I
have very little motivation to look into this. I'm glad you're
being as helpful as possible by identifying the exact commit to look
at. There's no need to apologize, I was just pointing out that
you had the wrong commit.

jgm commented

I've had a look. I see why it's happening, but I don't see an easy fix, other than reverting the change, which would reintroduce #19 (this would be very bad for people using peg-markdown in web apps).

I could fix this if I had more time, but I don't right now.

Perhaps for now, you should create a local fork and revert this change in your version; #19 shouldn't be a big issue for your users.

So does that mean peg-markdown is heading into "end-of-life"?

I'll throw my opinion out for whatever it's worth --- I think the PEG parser you've created is fantastic. The performance is great, and it allows for a lot of flexibility. Obviously, there are some challenges when trying to match a grammar like Markdown that is slightly more "intuitive", "ambiguous", or choose your adjective here. But I'm very impressed with it. (Which should be obvious since the last two versions of MultiMarkdown have been built around this.)

It's sad to hear that it might not be maintained moving forward, as you've been more successful at optimizing performance in some of these parts of the parser than I might have been (e.g. strong/emph, quotes, etc.). But I also understand that if you don't use it any more, you don't use it any more.

In any event, I do thank you for all the effort you've put into peg-markdown!

F-

Actually, the problem in #19 was a big problem for a few of my users. (It never ceases to amaze me that people are surprised when programs don't work well on strange/invalid input; that said, I do agree that my app shouldn't break because someone types the wrong thing into it... ;)

I saw your comments in the thread about #19. I'll try to dig in and see if I can understand your thought process at the time and see if I can come up with anything.

What is needed is a way to avoid parsing Emph when you're inside an Emph environment (and similarly for Strong). But (unless I'm mistaken) the code parts of the PEG are not executed unless the parser matches. That means that we can't set a variable that says "I'm inside an Emph inline" that affects the Inline parsers inside the Emph.

At first glance, I thought this seemed trivial, but now I think I understand the difficulty you mentioned. I'll have to experiment, but I think the "recipe" matches "left to right", but the included c code then effectively runs "right to left" or "inside to outside" depending on your perspective. I'm not doing a good job describing this, but perhaps it makes sense to you since you lived in this PEG for so long. So we would be setting the flag indicating that we're inside an emph after the parser processes the inside of the emph and looks for the flag. Is that roughly correct?

jgm commented

I think the best approach is to do something like what I do here:
https://github.com/jgm/Markdown/blob/master/Markdown.hs#L1168

I think this could be done in PEG. It parses nested emph / strong with no backtracking at all, and thereby avoids exponential blowup. This has been pretty well tested on edge cases.

I'll take a look. Thanks.

F-

On Jun 12, 2013, at 4:10 PM, John MacFarlane notifications@github.com wrote:

I think the best approach is to do something like what I do here:
https://github.com/jgm/Markdown/blob/master/Markdown.hs#L1168

I think this could be done in PEG. It parses nested emph / strong with no backtracking at all, and thereby avoids exponential blowup. This has been pretty well tested on edge cases.


Reply to this email directly or view it on GitHub.

Fletcher T. Penney
fletcher@fletcherpenney.net

I tried digging back into this today. Haskell is fairly inscrutable if you don't speak it. So I'm still struggling with understanding what you did there, so I just went back to the original code and the issue that was discovered with this:

The ***quick*** brown ***fox*** jumped 

A few basic tests suggests that the current peg-markdown approach does well, except when strong and emph are paired like the example. Could this be solved by simply treating "***" and "___" as a single unit when they occur together and test for this before Strong or Emph? Am I missing something obvious?

I added this, and it seems to do fine in my tests. But I'm probably missing something...

F-

PS> Hope all is well!

On Jun 12, 2013, at 4:10 PM, John MacFarlane notifications@github.com wrote:

I think the best approach is to do something like what I do here:
https://github.com/jgm/Markdown/blob/master/Markdown.hs#L1168

I think this could be done in PEG. It parses nested emph / strong with no backtracking at all, and thereby avoids exponential blowup. This has been pretty well tested on edge cases.


Reply to this email directly or view it on GitHub.

Fletcher T. Penney
fletcher@fletcherpenney.net

jgm commented

+++ Fletcher T. Penney [Jul 10 13 14:46 ]:

I tried digging back into this today. Haskell is fairly inscrutable if
you don't speak it. So I'm still struggling with understanding what you
did there, so I just went back to the original code and the issue that
was discovered with this:
The _quick_ brown _fox_ jumped
A few basic tests suggests that the current peg-markdown approach does
well, except when strong and emph are paired like the example. Could
this be solved by simply treating "***" and "___" as a single unit when
they occur together and test for this before Strong or Emph?

Well, yes, this is what I do in the Haskell code. But it's complex.
After "_" you could have a number of different things, and which
you have changes the meaning of "
". For example, if it continues "hello**"
or "hello_ there**", then "**" means "".
But if it continues "hello
* there_", then "" means "".
And if it continues "hello (end of paragraph)", then "
" means "_**".

John

The simple approach seems to fix the original problem, but in adding some more edge cases I found another one that fails:

The _quick_ brown _fox_ jumped (this one fails in peg-markdown but not my test version)

The _quick_ brown fox jumped

The **quick* brown fox* jumped

The quick brown fox jumped

The *quick brown fox* jumped

The **quick* brown fox* jumped (this one fails in peg-markdown and my test version)

I haven't found a counter example that is worse with my approach, but I guess it would be best to go ahead and try to solve the next issue as well..... It's always something.... ;)

F-

On Jul 10, 2013, at 6:04 PM, John MacFarlane notifications@github.com wrote:

Well, yes, this is what I do in the Haskell code. But it's complex.
After "_" you could have a number of different things, and which
you have changes the meaning of "
". For example, if it continues "hello**"
or "hello_ there**", then "**" means "".
But if it continues "hello
* there_", then "" means "".
And if it continues "hello (end of paragraph)", then "
" means "_**".

John

Fletcher T. Penney
fletcher@fletcherpenney.net

John,

I had some time a week or so ago, and implemented a fix for this in MMD-4. ( ad69c0d1bc )

It's a separate project, so can't be pulled into peg-markdown, and I know you are no longer supporting that project. But should you decide you're interested, the MMD-4 strong/emph parser now seems to pass all the tests I have come up with so far.

Hope all is well!

F-

Fletcher T. Penney
fletcher@fletcherpenney.net

On Jul 10, 2013, at 6:31 PM, Fletcher Penney fletcher@fletcherpenney.net wrote:

The simple approach seems to fix the original problem, but in adding some more edge cases I found another one that fails:

The _quick_ brown _fox_ jumped (this one fails in peg-markdown but not my test version)

The _quick_ brown fox jumped

The **quick* brown fox* jumped

The quick brown fox jumped

The *quick brown fox* jumped

The **quick* brown fox* jumped (this one fails in peg-markdown and my test version)

I haven't found a counter example that is worse with my approach, but I guess it would be best to go ahead and try to solve the next issue as well..... It's always something.... ;)

F-

On Jul 10, 2013, at 6:04 PM, John MacFarlane notifications@github.com wrote:

Well, yes, this is what I do in the Haskell code. But it's complex.
After "_" you could have a number of different things, and which
you have changes the meaning of "
". For example, if it continues "hello**"
or "hello_ there**", then "**" means "".
But if it continues "hello
* there_", then "" means "".
And if it continues "hello (end of paragraph)", then "
" means "_**".

John

Fletcher T. Penney
fletcher@fletcherpenney.net

jgm commented

Fletcher, good to hear. No time to look at this now...