Unexpected result with additional rule for custom self-closing tags
inliquid opened this issue · 5 comments
I was following this example to write a rule to process custom <mention>
tags in my input: https://github.com/JohannesKaufmann/html-to-markdown/blob/master/examples/custom_tag/main.go
Result was quite surprising, however not sure if this is a bug or misuse or maybe some limitations of the library?
Code:
package main
import (
"fmt"
"log"
md "github.com/JohannesKaufmann/html-to-markdown"
"github.com/PuerkitoBio/goquery"
)
func main() {
html := `
test
<mention user="user1" />
<mention user="user2" />
<mention user="user3" />
blabla
`
rule := md.Rule{
Filter: []string{"mention"},
Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
result := "@"
u, ok := selec.Attr("user")
if ok {
result += u
} else {
result += "unknown"
}
return &result
},
}
conv := md.NewConverter("", true, nil)
conv.AddRules(rule)
markdown, err := conv.ConvertString(html)
if err != nil {
log.Fatalln(err)
}
fmt.Println("markdown:\n", markdown)
}
Expected output:
markdown:
test
@user1
@user2
@user3
blabla
Observed output:
markdown:
test
@user1
Moreover, if I put these strings to debug what is going on in Replacement
calls, it becomes even more weird:
Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
result := "@"
u, ok := selec.Attr("user")
if ok {
result += u
} else {
result += "unknown"
}
html, err := selec.Html()
if err != nil {
log.Fatalln(err)
}
fmt.Println("content:", content)
fmt.Println("selec:", html)
fmt.Println("result:", result)
return &result
},
Output:
content:
blabla
selec:
blabla
result: @user3
content: @user3
selec:
<mention user="user3">
blabla
</mention>
result: @user2
content: @user2
selec:
<mention user="user2">
<mention user="user3">
blabla
</mention></mention>
result: @user1
@inliquid Your code seems okay 🤔
I think this might be related to Self-Closing HTML Tags. Goquery (or rather the underlying net/html
parser) might not expect that.
What happens if you do <mention user="user1"></mention>
instead of <mention user="user1" />
Yep, this case output is fine:
markdown:
test
@user1@user2@user3
blabla
@inliquid yeah sorry, that just doesn't seem to be supported by golang's net/html
parser. Self-closing tags are considered illegal by the parser. I can't do anything about that 🤷♂️
The library can only operate on what the parser can parse. And they do handle quite a lot of edge cases. Just not that, because it's probably not widely used enough...
Self-closing tags are considered illegal by the parser
Hmm.. Actually I have a lot of self-closing tags, such as img
ones and all of them are translated to markdown by the library with no issues.
@inliquid There are a few self-closing tags that are accepted in HTML5 (I think mostly for historical reasons & backward compatibility). For example <img />
, <hr />
, <br />
, <meta />
, ...
They are kind of special elements. I think technically you don't even need the end slash, so <img src="x">
would also work.
The net/html
parser deals with all of this weirdness. But if you think a certain behaviour is wrong, you need to raise an issue with the net/html package directly.