golang/go

encoding/json: allow per-Encoder/per-Decoder registration of marshal/unmarshal functions

rsc opened this issue ยท 49 comments

rsc commented
For example, if a user wants to marshal net.IP with custom code, we should provide a way
to do that, probably a method on *Encoder. Similarly for *Decoder.

Same for encoding/xml.

Comment 1:

Labels changed: removed go1.3.

rsc commented

Comment 2:

Labels changed: added go1.3maybe.

rsc commented

Comment 3:

Labels changed: added release-none, removed go1.3maybe.

rsc commented

Comment 4:

Labels changed: added repo-main.

I did some work on this sometime back in October, with CL https://go-review.googlesource.com/c/31091 to get the conversation started.
Perhaps we can work on it for Go1.9.

In there I introduced Encoder.RegisterEncoder

func (enc *Encoder) RegisterEncoder(t reflect.Type, fn func(interface{}) ([]byte, error))

and Decoder.RegisterDecoder

func (dec *Decoder) RegisterDecoder(t reflect.Type, fn func([]byte) (interface{}, error))

We have a similar requirement for custom serialization. Our specific use cases are:

  1. Data anonymization. For example, omit IP addresses in the JSON output.
  2. Data masking. For example, omit specific fields depending on user privileges.
  3. Omit empty values (e.g. empty strings). This is useful to generate compact documents.
    In all three cases, the decision to omit fields is done at runtime with contextual information.

We are using CiscoM31@1e9514f. In our case, the client interface is implemented by doing a simple lookup in a map, so there is no need to register hundreds of custom marshaller (we have lots of structs).
We could have used a technique similar to https://go-review.googlesource.com/c/31091.

@rsc is there still interest in this feature on your end? I have a use-case for it and would be happy to implement it.

[Edit 1] Spelling
[Edit 2]

I wonder if the function signature should be

func (enc *Encoder) RegisterMarshaller(t reflect.Type, f func(reflect.Value) ([]byte, error))

with all standard Encoders exposed so that users can leverage them. Example redactor:

package main

import (
	"bytes"
	"encoding/json"
	"os"
	"reflect"
	"strings"
	"unicode/utf8"
)

func main() {
	enc := json.NewEncoder(os.Stdout)
	text := "My password, foo, is totally secure"

	enc.RegisterMarshaller(reflect.TypeOf(""), StringMarshaller)
	enc.Encode(text)
	// Output
	// "My password, foo, is totally secure"

	enc.RegisterMarshaller(reflect.TypeOf(""), func(value reflect.Value) ([]byte, error) {
		return StringMarshaller(reflect.ValueOf(strings.Replace(value.String(), "foo", "[REDACTED]", -1)))
	})
	enc.Encode(text)
	// Output
	// "My password, [REDACTED], is totally secure"
}

// Largely taken from `func (e *encodeState) `string(s string, escapeHTML bool)` in `encoding/json/encode.go`
// This would exist in encoding/json.
func StringMarshaller(value reflect.Value) ([]byte, error) {
	e := bytes.Buffer{}
	s := value.String()
	escapeHTML := false // TODO: Refactor StringEncoder into a 'htmlEscaping' one and a non 'htmlEscaping' one.

	e.WriteByte('"')
	start := 0
	for i := 0; i < len(s); {
		if b := s[i]; b < utf8.RuneSelf {
			if json.HTMLSafeSet[b] || (!escapeHTML && json.SafeSet[b]) {
				i++
				continue
			}
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteByte('\\')
			switch b {
			case '\\', '"':
				e.WriteByte(b)
			case '\n':
				e.WriteByte('n')
			case '\r':
				e.WriteByte('r')
			case '\t':
				e.WriteByte('t')
			default:
				// This encodes bytes < 0x20 except for \t, \n and \r.
				// If escapeHTML is set, it also escapes <, >, and &
				// because they can lead to security holes when
				// user-controlled strings are rendered into JSON
				// and served to some browsers.
				e.WriteString(`u00`)
				e.WriteByte(json.Hex[b>>4])
				e.WriteByte(json.Hex[b&0xF])
			}
			i++
			start = i
			continue
		}
		c, size := utf8.DecodeRuneInString(s[i:])
		if c == utf8.RuneError && size == 1 {
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteString(`\ufffd`)
			i += size
			start = i
			continue
		}
		// U+2028 is LINE SEPARATOR.
		// U+2029 is PARAGRAPH SEPARATOR.
		// They are both technically valid characters in JSON strings,
		// but don't work in JSONP, which has to be evaluated as JavaScript,
		// and can lead to security holes there. It is valid JSON to
		// escape them, so we do so unconditionally.
		// See http://timelessrepo.com/json-isnt-a-javascript-subset for discussion.
		if c == '\u2028' || c == '\u2029' {
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteString(`\u202`)
			e.WriteByte(json.Hex[c&0xF])
			i += size
			start = i
			continue
		}
		i += size
	}
	if start < len(s) {
		e.WriteString(s[start:])
	}
	e.WriteByte('"')

	return e.Bytes(), nil
}
dsnet commented

I have a use-case for this as well, but it's a bit specialized. Essentially:

  1. There is a type that I own (as a library author) that is used by many users where they pass it through encoding/json when they shouldn't.
  2. I would like to make changes to my type, but it breaks these users. I have no intention to support the use of this type with encoding/json, but I can't just break these users since there are are sufficient number of them.
  3. I would like to use the feature proposed here to migrate these users to an implementation of marshal/unmarshal that preserves the exact behavior as it exists today (even if its buggy and wrong).
  4. This now frees me to make the changes I want to make.

For prior art, the cmp package has a Comparer option that allows the caller to override the comparison of any specific type in the tree. The ability for users to specify custom comparisons has proven itself to be immensely powerful and flexible.

Even though I'd like to have something like this, I do have some concerns:

  • It is already the case that users often complain that encoding/json is too slow. The addition of this feature may make things slower. For cmp, the need to check whether any given value node matches a custom comparer is a significant source of slow down.
  • A decision needs to be made whether the type override only supports concrete types or interfaces. Concrete types are easier to implement efficiently. However, supporting interfaces provides significant flexibility, but also brings in significant implementation and performance costs.
  • What's the expected behavior when json.Marshal encounters a type T, and the type override has a custom marshaler specified for *T? If the value is addressable, then it makes sense to address the value and call the override function. What if the value is not addressable?

Thanks for the response! Some comments below:

It is already the case that users often complain that encoding/json is too slow. The addition of this feature may make things slower. For cmp, the need to check whether any given value node matches a custom comparer is a significant source of slow down.

I'd have to run a benchmark but I would expect a map lookup to be fairly quick. Perhaps others are concerned with a different scale of "slow" than I.

A decision needs to be made whether the type override only supports concrete types or interfaces. Concrete types are easier to implement efficiently. However, supporting interfaces provides significant flexibility, but also brings in significant implementation and performance costs.

Interesting. I hadn't considered supporting interfaces. At the moment I only need concrete type overrides but if there are users out there that would benefit from an interface check I'd be willing to at least prototype it.

[Edit] Perhaps we could rename RegisterMarshaller to RegisterConcreteMarshaller and introduce RegisterInterfaceMarshaller later if users needed it?

What's the expected behavior when json.Marshal encounters a type T, and the type override has a custom marshaler specified for *T? If the value is addressable, then it makes sense to address the value and call the override function. What if the value is not addressable?

This actually came up in a discussion with a colleague. My preference would be to require users that want to custom marshal both T and *T to have to declare both marshaler overrides. You could use a single custom marshaler but would require both overrides.

...
var foo = ""
enc.RegisterMarshaller(reflect.TypeOf(foo), StringMarshaller)
enc.RegisterMarshaller(reflect.TypeOf(&foo), StringMarshaller)
...

func StringMarshaller(value reflect.Value) ([]byte, error) {
    ... check if value is a *String, if so deref ...
    ...
}

Alternately perhaps we modify the signiture of RegisterMarshaller to be something like:

func (enc *Encoder) RegisterMarshaller(reflect.Type, bool, func(reflect.Value) ([]byte, error))

where, if the bool was "true" it would match both T and *T and if it was false it would only match T or *T, depending on the passed in type. Looks a bit yucky to me though could be made to look better if we used the optional pattern.

@dsnet I imagine you might be busy with the holidays but I wanted to give you a friendly ping on this. Any thoughts on my response?

Change https://golang.org/cl/212998 mentions this issue: encoding/json: implement type override for serialization

dsnet commented

While the logic certainly uses reflect under the hood, I'm not sure if we should expose that in the public API. I propose the following API instead:

// RegisterFunc registers a custom encoder to use for specialized types.
// The input f must be a function of the type func(T) ([]byte, error).
//
// When marshaling a value of type R, the function f is called
// if R is identical to T for concrete types or
// if R implements T for interface types.
// Precedence is given to registered encoders that operate on concrete types,
// then registered encoders that operate on interface types
// in the order that they are registered, then the MarshalJSON method, and
// lastly the default behavior of Encode.
//
// It panics if T is already registered or if interface{} is assignable to T.
func (e *Encoder) RegisterFunc(f interface{})

// RegisterFunc registers a custom decoder to use for specialized types.
// The input f must be a function of the type func([]byte, T) error.
//
// When unmarshaling a value of type R, the function f is called
// if R is identical to T for concrete types or
// if R implements T for interface types.
// Precedence is given to registered decoders that operate on concrete types,
// then registered decoders that operate on interface types
// in the order that they are registered, then the UnmarshalJSON method, and
// lastly the default behavior of Decode.
//
// It panics if T is already registered or if interface{} is assignable to T.
func (d *Decoder) RegisterFunc(f interface{})

Arguments for this API:

  • Most users of this API probably do not want to deal with reflect.Type and reflect.Value, but rather want to deal with concrete types. For this reason, json.Unmarshal takes in an interface{} rather than an reflect.Value.
  • Using interface{} looses type safety, but the other proposed signatures also have type safety issues. For a signature like: func (enc *Encoder) RegisterMarshaller(t reflect.Type, f func(reflect.Value) ([]byte, error)), we still can't statically guarantee that t is the same type as the type that function f expects as it's input.
  • It matches what's done in other API that also have type-safety issues. For example, sort.Slices takes in an interface{} with the requirement that it be a slice kind, rather than a reflect.Value.

The proposed API combined with the ability to handle interfaces, allows you to do something like:

e := json.NewEncoder(w)
e.RegisterFunc(protojson.Marshal)
e.Encode(v)

where protojson.Marshal is a function that matches the expected function signature. It enables the standard encoding/json package to now be able to properly serialize all proto.Message types without forcing or expecting every concrete proto.Message type to implement the json.Marshaler interface.

Since this is something I need for my other work, I uploaded a CL with my prototype implementation that I've been testing with actual code.

Some additional thoughts:

  • One concern is that json.Decoder.DisallowUnknownFields specifies a top-level option that should in theory affect the results of all unmarshal operations recursively. It is already a problem today that custom json.Unmarshaler implementations do not properly respect this option (and has actually been a real problem). It's unfortunate that we're adding another way where options are not properly propagated downward. However, the severity of the problem is not as bad as custom json.Unmarshaler implementations since the location where options are specified is most likely also the place where custom encoder functions are registered, so options can be properly replicated at the top-level.
  • The implementation checks for custom functions to handle the current value type first, and if the value is addressable, also tries checking for a pointer to that value type. This matches the behavior of encoding/json for how it handles checking for json.Marshaler and json.Unmarshaler implementations.
  • The implementation does not call decoder functions if the receiver type is a pointer/interface and the input JSON is null. It also does not call encoder functions if the receiver type is a pointer/interface and is nil. This matches the behavior of encoding/json for how it handles json.Marshaler and json.Unmarshaler implementations.
  • I haven't benchmarked it yet, but expect decoding to take minimal performance hit and encoding to take no performance hit when there are no custom encoders/decoders.
  • Generics might enable greater type safety for the proposed API, but the current proposal for generics does not allow type parameterization at the granularity of individual methods.
dsnet commented

Adding Proposal label as there are at least two proposed APIs and this issue proposes added API to encoding/json.

dsnet commented

This proposal can be used to give users to ability to address the following issues on their own:

  • #10275: encoding: make stdlib types implement TextMarshaler/BinaryMarshaler consistently
  • #21990: encoding/json: support struct tag for time.Format in JSON Marshaller/Unmarshaller
  • #29678: net: add MarshalText/UnmarshalText to HardwareAddr
  • #33564: crypto/ecdsa: make PublicKey implement json.Unmarshaler and json.Marshaler

Overall, I've been concerned with the proliferation of magic methods where more and more types have MarshalJSON, UnmarshalJSON, MarshalText, or UnmarshalText methods attached. As the maintainer of a number of widely used libraries, I've been pushing back against users who ask for these methods to be added. A type override gives users the flexibility to do what they want, rather than forcing library authors to be responsible for maintaining these methods.

rsc commented

Why is it RegisterFunc and not Register?

rsc commented

Overall this seems reasonable to me. I would push back a little about not needing, say, MarshalText on types. It's good if types can create their own good representations by default when that makes sense. I would hope the per-Encoder registration would be necessary for exceptions, not common cases.

For example even if we had this registration it would still seem good to accept proposals like #33564.

dsnet commented

Why is it RegisterFunc and not Register?

Could be either. I don't feel strong about this name.

rsc commented

Let's use json.Register like gob.Register (different kind of argument but same idea).
Otherwise this seems like a likely accept.

dsnet commented

To clarify, are you suggesting that the type-override registration be a top-level Register function or are you merely suggesting that the Decoder and Encoder methods be named Register out of consistency with gob.Register?

rsc commented

@dsnet, only that the method on Decoder/Encoder be called Register.
It will be clear at call sites that a func is being passed; it doesn't add much to call it RegisterFunc.

rsc commented

No change in consensus, so accepting.

mvdan commented

@dsnet ended up sending a CL for this just weeks after I had self-assigned it; see https://go-review.googlesource.com/c/go/+/212998. I'm not sure why the bot didn't pick it up.

This is good news, though, as it seems like a proof of concept is implemented and we could have it in 1.16. I'm noting this here so more eyes can help review it.

mvdan commented

I started reviewing the change, and got surprised by the support for interfaces. @dsnet wrote:

A decision needs to be made whether the type override only supports concrete types or interfaces. Concrete types are easier to implement efficiently. However, supporting interfaces provides significant flexibility, but also brings in significant implementation and performance costs.

How was the decision made? I'm looking at decodeState.indirect in the CL, an I'm not a fan of the nested loop in the rather hot function. The godoc has no warning about the map lookup vs linear search penalty cost. I can imagine that registering ten custom interface marshalers could be surprisingly slower than registering ten concrete type marshalers, and the API doesn't give that impression.

There were a few use cases mentioned in this thread: net.IP of type []byte, time.Time of type struct, ecdsa.PublicKey of type struct, and net.HardwareAddr of type []byte. @dsnet also talked about types that "they own", and @harrisonhjones also mentions he just needs concrete type matching. Does anyone have a compelling use case for "assignable to" interface matching?

If the types can't be known at build time, or otherwise can't be matched in a concrete way, then I suggest that we support a single func([]byte, interface{}) fallback to be used if none of the concrete type funcs match. This won't be as powerful, but it will also remove the expensive linear search with go/types.Type.AssignableTo.

In any case, I suggest that we first implement this feature without support for interface types. After one Go release with support for concrete types, the users can tell us if it really is not powerful enough.

dsnet commented

At Google, we have https://go-review.googlesource.com/c/go/+/212998 imported as a separate fork that adventurous users are already using. The ability to specify type-override based on an interface is widely used. For our cases, it is typically used to capture all types that implement a proto.Message interface, but it's also used for other cases where a type doesn't have natural JSON support, and an interface match allows capturing types based on the public API (as specified by an interface) and to provide custom serialization for that type (based on the methods of that interface).

func([]byte, interface{}) fallback to be used if none of the concrete type funcs match. This won't be as powerful, but it will also remove the expensive linear search with go/types.Type.AssignableTo.

There are ways to implement the linear search in O(1) with type caches, I didn't implement that here because it seems like premature optimization for the case where there are many type-overrides given that are all interfaces (expected to be rare).

Also, a func([]byte, interface{}) fallback is 1) less performant since it now needs to be called for every node in the tree, rather than just named types that match the interface, 2) does not compose well with disjoint sets of interface types, 3) needs to also provide an API for saying; "thanks for calling me, but I actually don't know how to serialize this specific type; please try something else".

mvdan commented

The ability to specify type-override based on an interface is widely used.

And would using concrete types be a big problem in those two use cases you mention? Perhaps my uses of protobuf haven't been at "google scale", but I can imagine that registering up to a dozen or so concrete types should be more than enough for the majority of programs.

I didn't implement here because it seems like pre-mature for the case where there are many type-overrides given that are all interfaces.

Can you elaborate a bit? I'm not quite seeing how that cache would work in constant time. If we're confident that this can be made performant in the future, then I'm not as worried about adding both features at once.

Also, a func([]byte, interface{}) fallback is [...]

I do realise that a single fallback is less powerful, and perhaps less intuitive too. I brought it up because it seemed like the current design was bound to be a fairly expensive linear search by design.

mvdan commented

I also think that shipping support for concrete types alone in 1.16 would still be very useful. Googlers have told you what they think of both features, but from what you've said, they haven't tried using concrete types alone. And there isn't an easy way for developers outside of Google to do the same kind of testing (and very few people will go through the trouble unless we ship it in a release), so I fear that the feedback doesn't represent Go use cases outside of Google.

dsnet commented

Googlers have told you what they think of both features, but from what you've said, they haven't tried using concrete types alone

Concrete overrides are used to. I didn't mention them since the question I was responding to was about interface overrides.

And there isn't an easy way for developers outside of Google to do the same kind of testing (and very few people will go through the trouble unless we ship it in a release), so I fear that the feedback doesn't represent Go use cases outside of Google.

Is there something specific that you are proposing? More users of the feature would be great. For users that only care about concrete type overrides, the existence of the interface type override doesn't affect them if they don't use it. Am I missing something?

mvdan commented

Is there something specific that you are proposing?

Yes; perhaps I wasn't very clear. Releasing 1.16 with just support for concrete types, not interface types. A smaller and less powerful feature, but also simpler and faster. And we try to measure how many people find it useful versus how many people still want support for interface types.

If both are available at once, it's likely users will jump straight to interface types without thinking whether using concrete types would be enough. Which is also why the performance difference worries me.

dsnet commented

(apologies for responding out-of-order; for some reason my browser didn't render the earlier comment)

And would using concrete types be a big problem in those two use cases you mention?

Yes. For some cases, it would require registering dozens (if not hundreds) of concrete types. Even worse, it requires that the user of manually keep the code that declares their types constantly in sync with where those types are used with JSON serialization.

Can you elaborate a bit? I'm not quite seeing how that cache would work in constant time. If we're confident that this can be made performant in the future, then I'm not as worried about adding both features at once.

You can imagine a sync.Map that is functionally a map[reflect.Type]bool. It caches the answer to "will type T implement the set of registered interface type overrides?" If T is in the cache, then the query is O(1), if not, then we perform a linear lookup over all registered interface type overrides (which I contend is usually close to 0).

The challenge with the implementation is that the specific cache we want to use is itself dependent on the set of possible interface types that we may want to match against. Since I don't see having many interface type overrides being a common pattern, I didn't want to implement this for now.

dsnet commented

If both are available at once, it's likely users will jump straight to interface types without thinking whether using concrete types would be enough. Which is also why the performance difference worries me.

Can you explain how this would be the case? If there was a concrete type T that you wanted to provide an override for, then it seems natural that you would register a concrete type override. It seems odd to me to reach for an interface override unnecessarily for that case.

For example, let's say the user wants to provide a type override for time.Time, what interface would they be using to match that type? It seems more natural that they would register a func(t time.Time) ([]byte, error). Also, it would be easier to implement since working with concrete types is easier than interface types (where you're limited to the methods in the interface).

  1. needs to also provide an API for saying; "thanks for calling me, but I actually don't know how to serialize this specific type; please try something else".

I'd imagine this is necessary either way if you're supporting interfaces.

dsnet commented

I'd imagine this is necessary either way if you're supporting interfaces.

Can you provide an example?

mvdan commented

You can imagine a sync.Map that is functionally a map[reflect.Type]bool. It caches the answer to "will type T implement the set of registered interface type overrides?" If T is in the cache, then the query is O(1), if not, then we perform a linear lookup over all registered interface type overrides (which I contend is usually close to 0).

I have to admit I'm still not getting this. Such a map will only tell you if a type T has a match in the set of interfaces, but not which interfaces T matches, right? You'd still have to check them one by one. This is better for the case where T matches none of the interfaces, but still, the feature doesn't seem O(1) to me.

For some cases, it would require registering dozens (if not hundreds) of concrete types. Even worse, it requires that the user of manually keep the code that declares their types constantly in sync with where those types are used with JSON serialization.

I was imagining this would be for code-generated types (since you mentioned Protobuf), so I was going to suggest code generating the registration of concrete types too. Though, the more I think about the counter-argument, the less I like it.

Can you explain how this would be the case?

I was thinking of a scenario where an interface only has a few implementations which are all known at compile time. From a performance perspective, it's likely better to register the concrete types with a shared piece of code. I think you can ignore this point if you're confident that we can implement interface support without it being too expensive for reasonable use cases.

Side note: if you intend the CL's performance to remain as-is, I wonder if we should document that we don't recommend registering more than a handful of interface types on a single encoder or decoder.

dsnet commented

I have to admit I'm still not getting this. Such a map will only tell you if a type T has a match in the set of interfaces, but not which interfaces T matches, right? You'd still have to check them one by one. This is better for the case where T matches none of the interfaces, but still, the feature doesn't seem O(1) to me.

A simple modification of the cache is to use a map[reflect.Type]int where the int is the index of the interface that we care about.

Side note: if you intend the CL's performance to remain as-is, I wonder if we should document that we don't recommend registering more than a handful of interface types on a single encoder or decoder.

Sounds reasonable.

mvdan commented

Thanks for adding the context on the design and the thoughts on performance. I'll go back to Gerrit now :)

For what it's worth: I just reviewed https://go-review.googlesource.com/c/go/+/212998 and it looks good to me. Thanks for the work @dsnet

Thank you everyone, and Happy New Year! It is late in the cycle, and didnโ€™t get much movement on the CL, and @mvdan left comments on it, though @dsnet has been super busy. With that, I am punting this to Go1.17.

This issue is currently labeled as early-in-cycle for Go 1.17.
That time is now, so a friendly reminder to look at it again.

dsnet commented

@mvdan and I are doing a systematic review of the entire encoding/json API. We may delay the introduction of this to make sure it fits well with the overall direction we believe json should go.

  • Generics might enable greater type safety for the proposed API, but the current proposal for generics does not allow type parameterization at the granularity of individual methods.

I'm very late to the party, but I believe generics would probably be fine here, might reduce the amount of reflection needed, and like you said would enable greater type safety for the proposed API.

As for the type parameterization, I thought up something like this.

// Encoder
func MarshalIP(ip *net.IP) ([]byte, error) { ... }
func Register[T any](f func(T) ([]byte, error)) { ... }

// Decoder
func UnmarshalIP(data []byte) (*net.IP, error) { ... }
func Register[T any](f func([]byte) (T, error)) { ... }
// or
func UnmarshalIP(data []byte, ip *net.IP) error { ... }
func Register[T any](f func([]byte, T) error) { ... }

You'd obviously still need reflection to get the typeof out of T, but at least then we can ensure that what's being passed is of type func during compilation. You could do this by getting the type of f, and then pulling the parameter T out of that, or, you could just make a new T and throw that into reflect.TypeOf.

neild commented

@mvdan and I are doing a systematic review of the entire encoding/json API. We may delay the introduction of this to make sure it fits well with the overall direction we believe json should go.

How's this going?

If this doesn't conflict with the overall package direction, it'd be nice to try to get this in for 1.18.

dsnet commented

I apologize, I got side-tracked from json work quite a bit the past few months, but starting to get back into it more.

dsnet commented

In implementing this for v2. I think we may want to defer on this feature until the release of generics.

I propose the following alternative API:

// Marshalers is a list of functions that each marshal a specific type.
// A nil Marshalers is equivalent to an empty list.
// For performance, marshalers should be stored in a global variable.
type Marshalers struct { ... }

// NewMarshalers constructs a list of functions that marshal a specific type.
// Functions that come earlier in the list take precedence.
func NewMarshalers(...*Marshalers) *Marshalers

// MarshalFunc constructs a marshaler that marshals values of type T.
func MarshalFunc[T any](fn func(T) ([]byte, error)) *Marshalers

// Unmarshalers is a list of functions that each unmarshal a specific type.
// A nil Unmarshalers is equivalent to an empty list.
type Unmarshalers struct { ... }

// NewUnmarshalers constructs a list of functions that unmarshal a specific type.
// Functions that come earlier in the list take precedence.
// For performance, unmarshalers should be stored in a global variable.
func NewUnmarshalers(...*Unmarshalers) *Unmarshalers

// UnmarshalFunc constructs an unmarshaler that unmarshals values of type T.
func UnmarshalFunc[T any](fn func([]byte, T) error) *Unmarshalers

// WithMarshalers configures the encoder to use any marshaler in m
// that operate on the current type that the encoder is marshaling.
func (*Encoder) WithMarshalers(m *Marshalers)

// WithUnmarshalers configures the decoder to use any unmarshaler in u
// that operate on the current type that the encoder is marshaling.
func (*Decoder) WithUnmarshalers(u *Unmarshalers)

There are several advantages of this API:

  • Performance. Without generics, the implementation must go through reflect.Value.Call. Empirical testing using the tip implementation of generics shows that it is 4x faster than an implementation using reflect.Value.Call.
  • Type safety. The API that I proposed earlier provides no type safety to ensure that users are passing functions of the right signature.
  • Composability. Marshalers and Unmarshalers can be composed with one another. For example:
    m1 := NewMarshalers(f1, f2)
    m2 := NewMarshalers(f0, m1, f3) // equivalent to NewMarshalers(f0, f1, f2, f3)
  • Immutability. Marshalers and Unmarshalers are immutable once constructed. This allows the implementation to perform aggressive caching that would be complicated by the list of marshalers/unmarshalers changing underfoot. This would allow further performance benefits that my implementation for v1 fails to unlock.

Example usage:

var protoMarshalers = json.MarshalFunc(protojson.Marshal)

enc := json.NewEncoder(...)
enc.WithMarshalers(protoMarshalers)
... = enc.Decode(...)

Footnotes:

  • Credit goes to @rogpeppe for realizing that generics provides a significant performance benefit over Go reflection in this case.
  • The experiment was compiled using GOEXPERIMENT=unified since the compiler crashes otherwise.

\cc @mvdan

neild commented

Generics are arriving in 1.18. Should we add this (with generics) in 1.18, or delay a release?

dsnet commented

I keep forgetting that generics is coming imminently in 1.18. Assuming newly proposed API is acceptable, I support a release for 1.18.

In the event that generics is delayed, we could go with the following signatures:

func MarshalFunc(fn interface{}) *Marshalers
func UnmarshalFunc(fn interface{}) *Unmarshalers

It must rely on Go reflection for type safety at runtime. We could add a corresponding go vet or go build check that statically enforces type-safety so that we can switch it to the generic version in a backwards compatible way.

// WithUnmarshalers configures the decoder to use any unmarshaler in u
// that operate on the current type that the encoder is marshaling.
func (*Decoder) WithUnmarshalers(u *Unmarshalers)

Using a Decoder adds ~10 additional allocs and ~doubles memory consumption for callers which already have []byte data over use of json.Unmarshal, due to the Decoder only accepting a reader and then immediately reading the data into an internal buffer. Is there a plan to make the configurable decode behavior (like these custom unmarshalers, or existing options like DisallowUnknownFields/UseNumbers) accessible without such a significant performance penalty?

dsnet commented

Is there a plan to make the configurable decode behavior (like these custom unmarshalers, or existing options like > DisallowUnknownFields/UseNumbers) accessible without such a significant performance penalty?

There's been work on what a theoretical v2 json could look like (if ever). That API supports setting options apart from the Decoder. Some or all of those ideas could be back-ported into the current json package.

Hi! Golang version 1.18 with generics is already out. Are there any updates now?

dsnet commented

Hi all, we kicked off a discussion for a possible "encoding/json/v2" package that addresses the spirit of this proposal.

See the "Caller-specified customization" section of the discussion.
For concrete examples, see the examples under the WithMarshalers and WithUnmarshalers options.