Add invalidation for the schema-list cache

Question

Add invalidation for the schema-list cache

Closed this issue 2 years ago · 3 comments

Problem

Currently, schema lists are cached based on the SchemaListKey, which does not include the revision and addition parts of the full SchemaKey.

This makes up for the breaking scenario, particularly affecting the transformer.

New schema 1-0-1 is added to the server
Cache still has listing for 1-*-* as [1-0-0]
Transfomer receives the event with the new version.
Transformer would then consider the 1-0-0 to be the highest available schema for the model
New event, which was 1-0-1 would get downcasted to the 1-0-0 oftentimes producing a bad row.

Possible solutions

Create an API to invalidate the cache. So consumers could resolve such conflict if it arises.
Create an API listSimilarSchemas(k: SchemaKey). Which could detect is k was in the list and automatically invalidate/refresh the cache.

I think the second solution is a cleaner one.

Answer 1 · 2022-11-18T06:59:06.000Z

Hi @voropaevp I'm trying to think through how urgent this problem is. A typical cache ttl might be 600 seconds. Realistically on a production pipeline there will be >600 seconds between a schema being published to Iglu, and when production tracking starts using the new schema.

I understand that there is a 600 second window where either data is wrongly (but safely) loaded, or it becomes a failed event. But I'm not feeling like it's a critical problem.

Please tell me though if there is a more dangerous edge case that I haven't thought of.

These ideas around cache invalidation might become more important for development setups. E.g. if we ever put the rdb transformer into Snowplow Mini, or its replacement. In dev setups, it is probably more important to allow rapid evolution of schema versions.

This problem reminded me of another issue, which is similar but probably more important: This one. Although it's for Redshift, which you haven't started looking at yet.

Answer 2 · 2022-11-18T09:47:02.000Z

As this is possible...I think it's unlikely to happen on real life production setup? With TTL in cache we are sure that eventually we end up with fresh state everywhere, so not getting stuck with old state is the crucial bit.

This is somewhat similar case to patching schemas - like 1-0-0 is patched (changing content without version bump) in Iglu, new data referencing patched schema may arrive, while in cache we have stale content. What we can do about that?

But as we're in the process of making transformer/loader more robust/resilient and you have @voropaevp some neat idea how to solve (e.g. the second option you mentioned) it...why not? ;)

Answer 3 · 2022-11-18T09:51:50.000Z

I also like your second possible solution, out of the two.