/unhtml

HTML unmarshaler for golang

Primary LanguageGoMIT LicenseMIT

Coverage Status Go Report Card Build Status License: MIT Documentation

Table of Contents

Example & Performance

A HTML file

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <div id="test">
        <ul>
            <li>0</li>
            <li>1</li>
            <li>2</li>
            <li>3</li>
        </ul>
        <div>
            <p>Hexilee</p>
            <p>20</p>
            <p>true</p>
        </div>
        <p>Hello World!</p>
        <p>10</p>
        <p>3.14</p>
        <p>true</p>
    </div>
</body>
</html>

Read it

AllTypeHTML, _ := ioutil.ReadFile("testHTML/all-type.html")

If we want to parse it and get the values we want, like the following structs, how should we do it?

package example

type (
	PartTypesStruct struct {
		Slice   []int    
		Struct  TestUser 
		String  string   
		Int     int      
		Float64 float64  
		Bool    bool     
	}

	TestUser struct {
		Name      string 
		Age       uint   
		LikeLemon bool   
	}
)

In the traditional way, we should do it like this:

package example

import (
	"bytes"
	"github.com/PuerkitoBio/goquery"
	"strconv"
)

func parsePartTypesLogically() (PartTypesStruct, error) {
	doc, err := goquery.NewDocumentFromReader(bytes.NewReader(AllTypeHTML))
	partTypes := PartTypesStruct{}
	if err == nil {
		selection := doc.Find(partTypes.Root())
		partTypes.Slice = make([]int, 0)
		selection.Find(`ul > li`).Each(func(i int, selection *goquery.Selection) {
			Int, parseErr := strconv.Atoi(selection.Text())
			if parseErr != nil {
				err = parseErr
			}
			partTypes.Slice = append(partTypes.Slice, Int)
		})
		if err == nil {
			partTypes.Struct.Name = selection.Find(`#test > div > p:nth-child(1)`).Text()
			Int, parseErr := strconv.Atoi(selection.Find(`#test > div > p:nth-child(2)`).Text())
			if err = parseErr; err == nil {
				partTypes.Struct.Age = uint(Int)
				Bool, parseErr := strconv.ParseBool(selection.Find(`#test > div > p:nth-child(3)`).Text())
				if err = parseErr; err == nil {
					partTypes.Struct.LikeLemon = Bool

					String := selection.Find(`#test > p:nth-child(3)`).Text()
					Int, parseErr := strconv.Atoi(selection.Find(`#test > p:nth-child(4)`).Text())
					if err = parseErr; err != nil {
						return partTypes, err
					}

					Float64, parseErr := strconv.ParseFloat(selection.Find(`#test > p:nth-child(5)`).Text(), 0)
					if err = parseErr; err != nil {
						return partTypes, err
					}

					Bool, parseErr := strconv.ParseBool(selection.Find(`#test > p:nth-child(6)`).Text())
					if err = parseErr; err != nil {
						return partTypes, err
					}
					partTypes.String = String
					partTypes.Int = Int
					partTypes.Float64 = Float64
					partTypes.Bool = Bool
				}
			}
		}
	}
	return partTypes, err
}

It works pretty well, but is boring. And now, you can do it like this:

package main

import (
	"encoding/json"
	"fmt"
	"github.com/Hexilee/unhtml"
	"io/ioutil"
)

type (
	PartTypesStruct struct {
		Slice   []int    `html:"ul > li"`
		Struct  TestUser `html:"#test > div"`
		String  string   `html:"#test > p:nth-child(3)"`
		Int     int      `html:"#test > p:nth-child(4)"`
		Float64 float64  `html:"#test > p:nth-child(5)"`
		Bool    bool     `html:"#test > p:nth-child(6)"`
	}
	
	TestUser struct {
		Name      string `html:"p:nth-child(1)"`
		Age       uint   `html:"p:nth-child(2)"`
		LikeLemon bool   `html:"p:nth-child(3)"`
	}
)

func (PartTypesStruct) Root() string {
	return "#test"
}

func main() {
	allTypes := PartTypesStruct{}
	_ := unhtml.Unmarshal(AllTypeHTML, &allTypes)
	result, _ := json.Marshal(&allTypes)
	fmt.Println(string(result))
}

Result:

{
  "Slice": [
    0,
    1,
    2,
    3
  ],
  "Struct": {
    "Name": "Hexilee",
    "Age": 20,
    "LikeLemon": true
  },
  "String": "Hello World!",
  "Int": 10,
  "Float64": 3.14,
  "Bool": true
}

I think it can really improve the efficiency of my development, but what about its performance?

There are two benchmarks:

func BenchmarkUnmarshalPartTypes(b *testing.B) {
	assert.NotNil(b, AllTypeHTML)
	for i := 0; i < b.N; i++ {
		partTypes := PartTypesStruct{}
		assert.Nil(b, Unmarshal(AllTypeHTML, &partTypes))
	}
}

func BenchmarkParsePartTypesLogically(b *testing.B) {
	assert.NotNil(b, AllTypeHTML)
	for i := 0; i < b.N; i++ {
		_, err := parsePartTypesLogically()
		assert.Nil(b, err)
	}
}

Test it:

> go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/Hexilee/unhtml
BenchmarkUnmarshalPartTypes-4        	   30000	     54096 ns/op
BenchmarkParsePartTypesLogically-4   	   30000	     45188 ns/op
PASS
ok  	github.com/Hexilee/unhtml	4.098s

Not very bad, in consideration of the small size of the demo HTML. In true development with more complicated HTML, their efficiency is almost the same.

Tips & Features

The only API this package exposes is the function:

func Unmarshal(data []byte, v interface{}) error

which is compatible with the standard library's json and xml. However, you can do some jobs with the data types in your code.

Types

This package supports all kinds of type in the reflect package except Ptr/Uintptr/Interface/Chan/Func.

The following fields are invalid and will cause UnmarshalerItemKindError.

type WrongFieldsStruct struct {
    Ptr *int
    Uintptr uintptr
    Interface io.Reader
    Chan chan int
    Func func()
}

However, when you call the function Unmarshal, you MUST pass a pointer, otherwise you will get an UnmarshaledKindMustBePtrError.

a := 1

// Wrong
Unmarshal([]byte(""), a)

// Right
Unmarshal([]byte(""), &a)

Root

Return the root selector.

You are only supported to define a Root() string method for the root type, like

func (PartTypesStruct) Root() string {
	return "#test"
}

If you define it for a field type, such as TestUser

func (TestUser) Root() string {
	return "#test"
}

In this case, in PartTypesStruct, the field selector will be covered.

type (
	PartTypesStruct struct {
		...
		Struct  TestUser `html:"#test > div"`
		...
	}
)

// real
type (
	PartTypesStruct struct {
		...
		Struct  TestUser `html:"#test"`
		...
	}
)

Selector

This package is based on github.com/PuerkitoBio/goquery and supports standard css selectors.

You can define selectors of a field in tags, like this:

type (
	PartTypesStruct struct {
	   ...
		Int     int      `html:"#test > p:nth-child(4)"`
		...
	}
)

In most cases, this package will find the #test > p:nth-child(4) element and try to parse its innerText as int.

However, when the field type is Struct or Slice, it will be more complex.

Struct
type (
	PartTypesStruct struct {
		...
		Struct  TestUser `html:"#test > div"`
		...
	}

	TestUser struct {
		Name      string `html:"p:nth-child(1)"`
		Age       uint   `html:"p:nth-child(2)"`
		LikeLemon bool   `html:"p:nth-child(3)"`
	}
)

func (PartTypesStruct) Root() string {
	return "#test"
}

First, it will call *goquery.Selection.Find("#test"), we get:

    <div id="test">
        <ul>
            <li>0</li>
            <li>1</li>
            <li>2</li>
            <li>3</li>
        </ul>
        <div>
            <p>Hexilee</p>
            <p>20</p>
            <p>true</p>
        </div>
        <p>Hello World!</p>
        <p>10</p>
        <p>3.14</p>
        <p>true</p>
    </div>

Then, it will call *goquery.Selection.Find("#test > div"), we get

<div>
    <p>Hexilee</p>
    <p>20</p>
    <p>true</p>
</div>

Then, in TestUser, it will call

*goquery.Selection.Find("p:nth-child(1)") // as Name
*goquery.Selection.Find("p:nth-child(2)") // as Age
*goquery.Selection.Find("p:nth-child(3)") // as LikeLemon
Slice
type (
	PartTypesStruct struct {
		Slice   []int    `html:"ul > li"`		...
	}
)

func (PartTypesStruct) Root() string {
	return "#test"
}

As above, we get

    <div id="test">
        <ul>
            <li>0</li>
            <li>1</li>
            <li>2</li>
            <li>3</li>
        </ul>
        <div>
            <p>Hexilee</p>
            <p>20</p>
            <p>true</p>
        </div>
        <p>Hello World!</p>
        <p>10</p>
        <p>3.14</p>
        <p>true</p>
    </div>

Then it will call *goquery.Selection.Find("ul > li"), we get

  <li>0</li>
  <li>1</li>
  <li>2</li>
  <li>3</li>

Then, it will call *goquery.Selection.Each(func(int, *goquery.Selection)), iterate the list and parse values for slice.

Tags

This package supports three tags, html, attr and converter

html

Provide the css selector of this field.

attr

By default, this package regards the innerText of a element as its value

<a href="https://google.com">Google</a>
type Link struct {
    Text string `html:"a"`
}

You will get Text = Google. However, what should we do if we want to get href?

type Link struct {
    Href string `html:"a" attr:"href"`
    Text string `html:"a"`
}

You will get link.Href == "https://google.com"

converter

Sometimes, you want to process the original data

<p>2018-10-01 00:00:01</p>

You may unmarshal it like this

type Birthday struct {
	Time time.Time `html:"p"`
}

func TestConverter(t *testing.T) {
	birthday := Birthday{}
	assert.Nil(t, Unmarshal([]byte(BirthdayHTML), &birthday))
	assert.Equal(t, 2018, birthday.Time.Year())
	assert.Equal(t, time.October, birthday.Time.Month())
	assert.Equal(t, 1, birthday.Time.Day())
}

Absolutely, you will fail, because you don't define the way it converts a string to time.Time. unhtml will regard it as a struct.

However, you can use converter

type Birthday struct {
    Time time.Time `html:"p" converter:"StringToTime"`
}

const TimeStandard = `2006-01-02 15:04:05`

func (Birthday) StringToTime(str string) (time.Time, error) {
	return time.Parse(TimeStandard, str)
}

func TestConverter(t *testing.T) {
	birthday := Birthday{}
	assert.Nil(t, Unmarshal([]byte(BirthdayHTML), &birthday))
	assert.Equal(t, 2018, birthday.Time.Year())
	assert.Equal(t, time.October, birthday.Time.Month())
	assert.Equal(t, 1, birthday.Time.Day())
}

Make it.

The type of converter MUST be

func (inputType) (resultType, error)

resultType MUST be the same as the field type, and they can be any type.

inputType MUST NOT violate the requirements in Types.