sirthias/pegdown

Pegdown processor hangs when data to be parsed is html

Opened this issue · 11 comments

I am trying to read from a URL for which i do not have access, so its redirecting to login page. So the data input to pegDownProcessor.markdownToHtml(data) is actually HTML.

I was expecting either null or parsing exception but it hangs at markdownToHtml(data).

Here is my code:

//url = http://localhost:8098/download/attachments/3145973/basics.text?version=1&modificationDate=1449060565788&api=v2
InputStream stream = getUrlStream(info, profileHelper, url, false, getSessionCookie(info, url));
String data = ScriptUtils.getStreamAsString(stream, info.getMacroParams().getString("encoding", ""));
PegDownProcessor pegDownProcessor = new PegDownProcessor(Extensions.ALL - (hardwraps ? 0 : Extensions.HARDWRAPS)+ (allowHtml ? 0 : Extensions.SUPPRESS_ALL_HTML));
processed = pegDownProcessor.markdownToHtml(data);
log.debug("processed: {}", processed);

Any help to deal this is appreciated.
Thanks!!!

vsch commented

@sunitapatro, I suggest you dump the data so that you can see what is returned. Then you should pre-screen this type of input to pegdown to prevent it from hanging. Please post the errant data so that pegdown can potentially be updated to prevent such hangs.

Thanks for responding.

Here is the 'data' input to pegDownProcessor.markdownToHtml(data)
urldata.txt

vsch commented

@sunitapatro, what I meant by data is not the markdown you expect to get but the actual HTML returned by the getStreamAsString() caused by the redirect to login.

Add code to log the received data from the get stream, before passing of it to pegdown, so that the cause of the hang can be debugged.

You should probably add code at the same point that will detect that the data coming back is not markdown but HTML and present it as is, without pegdown processing, so that when this happens there is some feedback to the user.

The data that is being returned by the getStreamAsString() is nothing but content of "urldata.txt" which i shared earlier. I just saved it as .txt just to share here.
Its a NEGATIVE test, actually getStreamAsString() was supposed to return .text file (with markdown syntax data) read from an URL, but since that URL needs authentication, so it returns the login.html page. So the urldata.txt is nothing but HTML content if you see.

In short, the urldata.txt (contains HTML) is the content input to PegDownProcessor.

I understand that its wrong content to PegDownProcessor, but then i was expecting PegDownProcessor to return either some exception or null or something like that. But reality is its hanging.

vsch commented

nothing but HTML content covers a universe of possibilities. It is impossible to guess what exactly is causing the problem in pegdown parser without having input that duplicates the problem. After all, pegdown is just another program, like yours, all debugging requires input to be able to narrow down where things go wrong.

Validating input is really limited to markdown and the handled HTML tags. Handling unadulterated HTML response from a server is outside of its intended application. I do agree that it should not hang, but without having a file which causes the hang I can't being to figure out what causes it.

It is up to the implementation specific code to make sure that what is fed into pegdown can at least be considered as markdown.

@vsch,

Please read the file urldata.txt content to Stream, convert to String and pass to PegDownProcessor and you will be able to reproduce PegDownProcessor hang.
urldata.txt

@vsch
Any update on this?

vsch commented

@sunitapatro, to save time, I opened the file in pegdown using my IntelliJ IDEA plugin (idea-multimarkdown) which uses pegdown as the parser, by renaming it to urldata.md and opening it in IDEA. I saw no issues and no hangs.

The problem occurs when you read the file as a stream and convert to string, it is not a pegdown issue but in the code before the pegdown call.

As a test I suggest before passing the string data to pegdown, convert it to a char[] which is what pegdown does via string.toCharArray() and then dump the char array as bytes to a file and examine what contents you are really passing to pegdown. The file you provided does not cause pegdown to hang, so the issue is somewhere else.

I find this problem too. The version is 1.6.0

final PegDownProcessor pegDownProcessor = new PegDownProcessor(Extensions.ALL_OPTIONALS | Extensions.ALL_WITH_OPTIONALS, 5000);
//markdownText is pure html
final RootNode node = pegDownProcessor.parseMarkdown(markdownText.toCharArray());

code like above, and it will stop the function exactly at parseMarkdown function
sorry for my poor English

qq 20161124095131
it seems that the code drop into a never end loop

I solve the problem by add extra tags like “< html >< body >”+data+"</ body ></ html >"
data is pure html witch contains many < li >