headzoo/surf

New method: webpage "as user would see" - like htmlunit page.asNormalizedText()

BigB84 opened this issue · 2 comments

Hi,
I write app that needs to read webpage as would user see, then save it to file (without interaction).
Here's a webpage I need to process.

I've read the docs and tried to do it with bow.Body() but I get the html source so with tags like <pre> <p> so bufio reads it and it does mess, of course I can post-process removing all < started etc. but It's a lot of code to cover all scenarios.

I've done it in java once with htmlunits page.asNormalizedText() or python with selenium (I know there's selenium for go, but I'd rather omit additional webdriver config etc. that's why I also use your library :))

Do you think it'd be good to add such feature? Or if, you don't think it's a good idea, could you help me find other solution?
Thanks in advance

You could try using a specific css selector instead of bow.Body(). For instance.

bow.Dom().Find("body p pre").Each(func(_ int, s *goquery.Selection) {
    fmt.Println(s.Text())
})

That should give you the text inside of the inner

 tag.

Thanks! :)