/gosax

Go library for XML SAX (Simple API for XML) parsing

Primary LanguageGoOtherNOASSERTION

gosax

Go Reference

gosax is a Go library for XML SAX (Simple API for XML) parsing, supporting read-only functionality. This library is designed for efficient and memory-conscious XML parsing, drawing inspiration from various sources to provide a performant parser.

Features

  • Read-only SAX parsing: Stream and process XML documents without loading the entire document into memory.
  • Efficient parsing: Utilizes techniques inspired by quick-xml and pkg/json for high performance.
  • SWAR (SIMD Within A Register): Optimizations for fast text processing, inspired by memchr.
  • Compatibility with encoding/xml: Includes utility functions to bridge gosax types with encoding/xml types, facilitating easy integration with existing code that uses the standard library.

Benchmark

goos: darwin
goarch: arm64
pkg: github.com/orisano/gosax
BenchmarkReader_Event-12    	       5	 211845800 ns/op	1103.30 MB/s	 2097606 B/op	       6 allocs/op

Installation

To install gosax, use go get:

go get github.com/orisano/gosax

Usage

Here is a basic example of how to use gosax to parse an XML document:

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/orisano/gosax"
)

func main() {
	xmlData := `<root><element>Value</element></root>`
	reader := strings.NewReader(xmlData)

	r := gosax.NewReader(reader)
	for {
		e, err := r.Event()
		if err != nil {
			log.Fatal(err)
		}
		if e.Type() == gosax.EventEOF {
			break
		}
		fmt.Println(string(e.Bytes))
	}
	// Output:
	// <root>
	// <element>
	// Value
	// </element>
	// </root>
}

Bridging with encoding/xml

Important Note for encoding/xml Users:

When migrating from encoding/xml to gosax, note that self-closing tags are handled differently. To mimic encoding/xml behavior, set gosax.Reader.EmitSelfClosingTag to true. This ensures self-closing tags are recognized and processed correctly.

Using TokenE

If you are used to encoding/xml's Token, start with gosax.TokenE. Note: Using gosax.TokenE and gosax.Token involves memory allocation due to interfaces.

Before:

var dec *xml.Decoder
for {
	tok, err := dec.Token()
	if err == io.EOF {
		break
	}
	// ...
}

After:

var dec *gosax.Reader
for {
	tok, err := gosax.TokenE(dec.Event())
	if err == io.EOF {
		break
	}
	// ...
}

Utilizing xmlb

xmlb is an extension for gosax to simplify rewriting code from encoding/xml. It provides a higher-performance bridge for XML parsing and processing.

Before:

var dec *xml.Decoder
for {
	tok, err := dec.Token()
	if err == io.EOF {
		break
	}
	switch t := tok.(type) {
	case xml.StartElement:
		// ...
	case xml.CharData:
		// ...
	case xml.EndElement:
		// ...
	}
} 

After:

var dec *xmlb.Decoder
for {
	tok, err := dec.Token()
	if err == io.EOF {
		break
	}
	switch tok.Type() {
	case xmlb.StartElement:
		t, _ := tok.StartElement()
		// ...
	case xmlb.CharData:
		t, _ := tok.CharData()
		// ...
	case xmlb.EndElement:
		t := tok.EndElement()
		// ...
	}
} 

License

This library is licensed under the terms specified in the LICENSE file.

Acknowledgements

gosax is inspired by the following projects and resources:

Contributing

Contributions are welcome! Please fork the repository and submit pull requests.

Contact

For any questions or feedback, feel free to open an issue on the GitHub repository.