antchfx/htmlquery

Get a panic when parse html page

aaronchen2k opened this issue · 6 comments

Get a fatal panic when executing htmlquery.QueryAll on webpage from url https://baidu.com OR local file baidu.html as below script.
https://github.com/aaronchen2k/deeptest/blob/main/cmd/test/htmlquery_test.go

It works well if use a html string like:
https://github.com/aaronchen2k/deeptest/blob/main/internal/server/modules/v1/helper/mock/html.go

Thanks!

May be the http response is gzip mode. you should decompress gzip before parsing .

May be the http response is gzip mode. you should decompress gzip before parsing .

In this test script test/htmlquery_test.go' , I read html from a local file, still cause a fatal panic.
Please help to check, thanks.

html := fileUtils.ReadFile("baidu.html")

The local baidu.html file is good on my local test code.

test code below:

	f, err := os.Open("./baidu.html")
	if err != nil {
		panic(err)
	}
	doc, err := htmlquery.Parse(f)
	if err != nil {
		panic(err)
	}
	//  "//form[@id=1]/input[@id=\"kw\"]/@class" is invalid. changed to @id="1", 
	expression := `//form[@id="1"]/input[@id="kw"]/@class`
	list, err := htmlquery.QueryAll(doc, expression)
	if err != nil {
		panic(err)
	}
	fmt.Println(len(list))

The local baidu.html file is good on my local test code.

test code below:

	f, err := os.Open("./baidu.html")
	if err != nil {
		panic(err)
	}
	doc, err := htmlquery.Parse(f)
	if err != nil {
		panic(err)
	}
	//  "//form[@id=1]/input[@id=\"kw\"]/@class" is invalid. changed to @id="1", 
	expression := `//form[@id="1"]/input[@id="kw"]/@class`
	list, err := htmlquery.QueryAll(doc, expression)
	if err != nil {
		panic(err)
	}
	fmt.Println(len(list))

Thank you for feedback!
I update the codes, now there is no error, but why the list always nil?
image

image

your query xpath is not correct. The local html file no any form with id=1 attribute. //form[@id="form"]/input[@id="kw"]/@class. You can use chrome develop tool(Inspect) or https://www.freeformatter.com/xpath-tester.html to test your xpath.