salsify/avromatic

Error handling when working with a registry

Closed this issue · 3 comments

fxn commented

I am trying to understand potential errors when working with a registry (only for decoding).

If I understand decoding correctly, we have a reader's schema that comes from the configured schema store (static, file system), and a writer's schema that is fetched as needed from the registry, and cached.

If I am right, that means that the registry isn't hit to fetch any writer schema until a message arrives (makes sense by definition), and after that you only hit the registry if a message with a different version arrives.

If that is correct, would like to understand these things:

  1. Is there a way to fail fast when booting (before messages arrive) in order to detect a misconfigured URL?

  2. Which is the behavior of the gem if the registry is unreachable? Does it have a retry mechanism or does it raise right away? If it raises, I guess you should be ready for that with any message, right? Because of schema evolution and the need to fetch future schemas.

  3. If a writer's schema is incompatible with the reader's schema you get Avro::IO::SchemaMatchException, right?

  4. If a schema is invalid Avro::SchemaValidator::ValidationError seems to be raised, right? What is an invalid schema? Does the registry accept invalid schemas?

Anything else that could be relevant, please share!

Also, if you'd like to have any of this information included in the docs, would be glad to volunteer a patch.

tjwp commented

Hi @fxn,

Sorry for the delay getting back to you.

  1. Is there a way to fail fast when booting (before messages arrive) in order to detect a misconfigured URL?

The best way that I can think to do this at boot time is to make a request to the registry that is not schema specific, and then fail if the request fails. Something like:

From the perspective of this gem, one of those methods can be called on Avromatic.schema_registry.

  1. Which is the behavior of the gem if the registry is unreachable? Does it have a retry mechanism or does it raise right away? If it raises, I guess you should be ready for that with any message, right?Because of schema evolution and the need to fetch future schemas.

This gem will fail immediately if it needs to make a request to the registry and it is unavailable. There is no builtin retry mechanism.

Each version of a schema is only fetched once and cached in memory. But if messages arrive with ids for new schema versions they will need to be fetched.

The requests to fetch a schema by id are cacheable, so if possible it is good to have caching layer in front of the registry to increase availability.

  1. If a writer's schema is incompatible with the reader's schema you get Avro::IO::SchemaMatchException, right?

Yes, I believe you are correct: Avro::IO::SchemaMatchException is raised if the schemas are incompatible.

  1. If a schema is invalid Avro::SchemaValidator::ValidationError seems to be raised, right? What is an invalid schema? Does the registry accept invalid schemas?

If a schema is invalid, then you'll get an Avro::SchemaParseError. A registry should not accept invalid schemas, but different languages implement more or less of the Avro specification. For example, the official ruby implementation does not validate field defaults. This was recently added to avro-patches. So this behavior may depend on which registry you are using.

If you're looking at Avro::SchemaValidator::ValidationError's then you are probably using avro-patches or the master branch from the official avro repo. This error is raised by the validator when a datum to be encoded is incompatible with the schema. Usually these errors are not raised unless you call the SchemaValidator directly. Instead the ValidationError is rescued and Avro::IO::AvroTypeError is raised.

I hope that helps!

fxn commented

Awesome reply, thanks very much @tjwp!

fxn commented

This gem will fail immediately if it needs to make a request to the registry and it is unavailable.

Don't know if that may be a localhost vs network difference, but in my tests using actual services the failure is not really immediate, it takes 60 seconds. Since my purpose is to fail fast at boot time to catch configuration errors, I need to reduce the timeout.

Apparently, ultimately the one responsible for that minute is excon, and that is deep inside dependencies, with no API exposed for timeouts as far as I can tell. I have seen this related issue in avro_turf. I have also seen that excon doesn't have configurable global timeouts.

Early, the boot process changes the defaults of excon this way:

require "excon"
Excon.defaults[:connect_timeout] = 5
Excon.defaults[:read_timeout] = 5

A bit hackish, and for starters the client used is private so no guarantees this is going to work in the future. But I believe that is enough for my use case today.