pinterest/elixir-thrift

Use atoms for enums

Opened this issue · 23 comments

Currently Enum values are represented by integers in Elixir.

enum Color { RED = 1, GREEN = 2, BLUE = 3 }

struct Pixel {
  1: optional Color color
}
require Color
assert Color.RED == 1
pixel = %Pixel{color: Color.RED}

# Is equivalent to:
pixel = %Pixel{color: 1}

case pixel.color do
  Color.RED -> :ok
end

It seems like it would be nicer if enums were represented by atoms.

pixel = %Pixel{color: :RED}

case pixel.color do
  :RED -> :ok
end

Pros

  1. If you inspect/print a value, it's confusing to see a number and have to go find the thrift definition and map the number back into a name that you can reason about. An atom is directly readable.
> IO.inspect user
%User{status: 7}           # what is that?

...versus...

> IO.inspect user
%User{status: :INACTIVE}   # OK makes sense
  1. Less generated code, faster compile times.

  2. You don't have to require a generated enum module to use the values.

  3. It's more Elixir-y, I think.

Cons

  1. No longer catches invalid enum values. However it didn't catch enums being used in the wrong context, so we need Dialyzer annotations either way for that.

It's more Elixir-y, I think.

Definitely more idiomatic for Elixir.

Sorry I missed this. This would definitely break production code for me :(

The value sent over the wire is always going to be an integer. In a lot of cases we do something along the lines of

  • Grab user_type_id from database
  • Stuff user_type_id into thrift struct
  • Send it over the wire

With the values being integers, if we trust the database (in our case, we do), we never have to do anything other than copy the value directly into the thrift struct.

If nothing else, this would definitely be a breaking change for some people.

I've honestly gone back and forth on this a few times. Even before using thrift, and before we started using Ecto (:older_man:), I would have situations where I pulled something from a database that was essentially an enumerated field. I've tried this both ways a few times. Thrash actually had enums as atoms and I was skeptical when you guys implemented them as integers at first, but after the initial migration I've started to actually prefer it that way.

Some observations, which I present in the humblest way possible:

  • Fundamentally, I think enums are "enumerated values". From that standpoint, the atoms we're using are the names of the values, not the values themselves.
  • In other languages, I believe they treat them as integers. It's sort of based around the way that enums work in C/C++.
  • The debugging case could potentially be addressed by creating an inspect implementation so that we display the atom on screen.
  • Using atoms means you need to introduce a layer to do the translation between integer and atom.
  • Consider the pass-through case - service A -> service B -> service C. If service B doesn't care about the business logic related to the enum names, it's going to end up doing a translation that it wouldn't otherwise need to do.
  • Related to the pass-through case, the conversion layer adds a maintenance burden. If we add new values to the enum, we need to upgrade service B so that it knows how to do the translation. I think this is sort of un-thrift-y. If they are integer values, service B doesn't care. We take advantage of this in our code a fair amount.

Hopefully this doesn't seem confrontational or objection-y - it's just a topic I've already put a fair amount of thought into, and I'm fairly overloaded so I'm trying to brain dump :/ Totally willing to discuss, and ultimately, though I really appreciate you guys including me as a contributor, it's you guys doing the lion's share of the work so I won't put up too much of a fight.

I think it's particularly helpful to keep in mind that, in Thrift, enumerated values are always integers (i32), as they are in C. We don't need to design a solution that allows for other value types.

One bit of prior art: https://github.com/gjaldon/ecto_enum

This only supports PostgreSQL, and that database exposes them as strings over the wire to user's applications. So when selecting an enum column the value would be a string and no number attached. I say number because postgreSQL uses float32 IIRC, for ordering and uniqueness.

To clarify - I think a lot of the issues I've had in the past related to this kind of thing stem from the difference in type between the name and the value, not necessarily the values having different types (they're almost always integers). My previous comment enumerates some specifics, but I think it mostly stems from that fundamental difference. The values are integers, the act of enumerating them is to give them names. When you try to treat the values as their names, things seem to get funky.

The debugging case could potentially be addressed by creating an inspect implementation so that we display the atom on screen.

It's nice to keep the inspect pattern so that it easy to copy and paste. We could support inspecting with atoms but then we may want to support serializing from atoms as well as integers but always deserialize to integers (assuming we kept existing behavior).

Using atoms means you need to introduce a layer to do the translation between integer and atom.

Consider the pass-through case - service A -> service B -> service C. If service B doesn't care about the business logic related to the enum names, it's going to end up doing a translation that it wouldn't otherwise need to do.

Thats true. Note that the consecutive integers -> atoms is unbelievable fast in the VM, and atoms -> integers is also very fast but not quite as fast. The memory usage is the same and both are immediate values. The difference should be small enough that we would pick the better behavior in terms of usability:

Related to the pass-through case, the conversion layer adds a maintenance burden. If we add new values to the enum, we need to upgrade service B so that it knows how to do the translation. I think this is sort of un-thrift-y. If they are integer values, service B doesn't care. We take advantage of this in our code a fair amount.

This suggests that we would want to deserialize unknown integers to integers, and allow serializing integers to integers. Similar to above. I am unsure how I feel about either way.

It would be quite easy to write a version of ecto_enum that uses the thrift enum values and that gets recompiled when enums change.

The integer approach has a pro and a con: the calls of the macros get recompiled when the enum changes. Whereas atoms won't trigger a recompile. In new versions of Elixir we should have user code get recompiled if the struct keys change.

The values are integers, the act of enumerating them is to give them names. When you try to treat the values as their names, things seem to get funky.

I view this the other way around, an enum is a set of identifiers that happen to get serialized as integers. From a users perspective the set of identifiers is more useful and powerful.

Another "fundamental" concern to me: Backwards and forwards compatibility is a core concern in thrift. The protocol and existing clients allow for the contents of a message to change. I think the implementation around enums should comply. If an enum value is added upstream, I don't think client applications should start crashing because they don't know how to convert the enum to its name.

Another "fundamental" concern to me: Backwards and forwards compatibility is a core concern in thrift. The protocol and existing clients allow for the contents of a message to change. I think the implementation around enums should comply. If an enum value is added upstream, I don't think client applications should start crashing because they don't know how to convert the enum to its name.

Is this covered by:

This suggests that we would want to deserialize unknown integers to integers, and allow serializing integers to integers.

Or do you have another concern?

No I think that's kind of it. Nothing in the thrift spec says that a client is required to validate the values of an enumerated field. That may or may not be intentional, but I think it jives with forward/backward compatibility.

A good example of this is TApplicationException. Here we support a few more error types than some other libraries. However it does not matter for other languages if we send the higher value, as long as at some point we converge to the same set, to those languages it is just an unknown error until they "learn" it.

What about a hybrid solution? I'm not sure I like it much better, but you could

  • On serialization, pass integers through and convert atoms to integers.
  • On deserialization, attempt to convert to atom but keep integer if an atom is not found.

That would preserve forward/backward compatibility, allow the developer to treat the values as values and not their names if they so choose (which is The Right Way in my curmudgeonly opinion), but also make possible the ergonomic improvements you guys are talking about.

OTOH It would make the API less concrete, which is something I'm not a huge fan of. But it's not that fuzzy, so 🤷‍♂️

Really good points @dantswain, thanks for bringing these up.

The pass-through case has got us in the past, should have considered that. I think the hybrid atom/integer approach is a good solution though.

So here's how I understand that the API will look, including @jparise's proposed rename of name_to_value/1/value_to_name/1.

enum UserStatus {
  ACTIVE = 1,
  INACTIVE = 2,
}

struct User {
  1: UserStatus status;
}
{user, ""} = User.deserialize(thrift_binary)

# Recognized integer deserialized as atom.
%User{status: :ACTIVE} = user

# Unrecognized (perhaps recently added) integer deserialized as integer.
%User{status: 3} = user

# serialize() handles both atoms and known/unknown integers.
User.serialize(%User{status: :ACTIVE})
User.serialize(%User{status: 2})
User.serialize(%User{status: 3})

# value() converts both atoms and integers to integers.
1 = UserStatus.value(:ACTIVE)
2 = UserStatus.value(2)
3 = UserStatus.value(3)

# name() converts both atoms and integers to atoms if recognized, else integer.
:ACTIVE   = UserStatus.name(:ACTIVE)
:INACTIVE = UserStatus.name(2)
3         = UserStatus.name(3)

My expectations for the user experience of this change:

  1. If you're just serializing/deserializing, i.e. the pass-through scenario, this will be transparent.

  2. If you're branching on or constructing enum values using the generated macros, it will still work but will print deprecation warnings at compile time.

  3. If you're using hard-coded enum integers, it will break.

  4. If you're deserializing enums and writing them to a database as integers, it will break. You'll need to start passing them through value/1.

  5. If you're reading enum integers from a database and serializing them, it will work transparently.

Note that as long as we’re generating macros, we have to worry about name classes with the meta, name, and value functions.

I could accept this as a compromise, but I'm still a little dubious on mixing types for the values :/ The values are integers... that's sort of unambiguous in the Thrift IDL. ACTIVE=1 => 1 is the value, ACTIVE is the name. 🤷‍♂️

If we go this route, do we want to pass the names through the same case transformations that we have been using? I.e., should :ACTIVE become :active? We do that with constants, I think.

An alternative could be to implement the inspect protocol where the integer values are replaced by their macro calls (rather than the atoms suggested above):

IO.inspect user
%User{status: Status.INACTIVE}

Is this too much magic?

Oooh. I kind of like that, actually.

@pguillory what do you think about inspecting with macro calls?

An alternative could be to implement the inspect protocol where the integer values are replaced by their macro calls

Is this too much magic?

I tried this out, and unfortunately its not enough magic :(. It worked great for when we had the struct itself. However if we have a set of enums (or another collection) we have to rely on MapSet to inspect these and we still get the integers. We can actually make it work if the MapSet is nested inside the struct itself. It was very nice to use when inspect gave the macro call but having different inspection results based on context is confusing.

I think that deserializing known values to atoms and unknown values to integers and serializing both the atoms and values is a good way to go because it is the clearest user experience, and feels more natural in Elixir. So what @pguillory proposed above.