JohannesKaufmann/html-to-markdown

Unexpected result with additional rule for custom self-closing tags

inliquid opened this issue · 5 comments

I was following this example to write a rule to process custom <mention> tags in my input: https://github.com/JohannesKaufmann/html-to-markdown/blob/master/examples/custom_tag/main.go

Result was quite surprising, however not sure if this is a bug or misuse or maybe some limitations of the library?

Code:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
	"github.com/PuerkitoBio/goquery"
)

func main() {
	html := `
	test
	
	<mention user="user1" />
	<mention user="user2" />
	<mention user="user3" />

	blabla
	`

	rule := md.Rule{
		Filter: []string{"mention"},
		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			return &result
		},
	}

	conv := md.NewConverter("", true, nil)
	conv.AddRules(rule)

	markdown, err := conv.ConvertString(html)
	if err != nil {
		log.Fatalln(err)
	}

	fmt.Println("markdown:\n", markdown)
}

Expected output:

markdown:
 test
	
 @user1
 @user2
 @user3

 blabla

Observed output:

markdown:
 test

 @user1

Moreover, if I put these strings to debug what is going on in Replacement calls, it becomes even more weird:

		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			html, err := selec.Html()
			if err != nil {
				log.Fatalln(err)
			}

			fmt.Println("content:", content)
			fmt.Println("selec:", html)
			fmt.Println("result:", result)

			return &result
		},

Output:

content: 

 blabla  

selec:

        blabla 

result: @user3 
content: @user3
selec:
        <mention user="user3">

        blabla
        </mention>
result: @user2
content: @user2
selec:
        <mention user="user2">
        <mention user="user3">

        blabla
        </mention></mention>
result: @user1

@inliquid Your code seems okay 🤔

I think this might be related to Self-Closing HTML Tags. Goquery (or rather the underlying net/html parser) might not expect that.

What happens if you do <mention user="user1"></mention> instead of <mention user="user1" />

Yep, this case output is fine:

markdown:
 test

 @user1@user2@user3

 blabla

@inliquid yeah sorry, that just doesn't seem to be supported by golang's net/html parser. Self-closing tags are considered illegal by the parser. I can't do anything about that 🤷‍♂️

The library can only operate on what the parser can parse. And they do handle quite a lot of edge cases. Just not that, because it's probably not widely used enough...

Self-closing tags are considered illegal by the parser

Hmm.. Actually I have a lot of self-closing tags, such as img ones and all of them are translated to markdown by the library with no issues.

@inliquid There are a few self-closing tags that are accepted in HTML5 (I think mostly for historical reasons & backward compatibility). For example <img />, <hr />, <br />, <meta />, ...

They are kind of special elements. I think technically you don't even need the end slash, so <img src="x"> would also work.

The net/html parser deals with all of this weirdness. But if you think a certain behaviour is wrong, you need to raise an issue with the net/html package directly.