castorm/kafka-connect-http

Handle records in reverse chronological order

castorm opened this issue · 7 comments

Problem

Some APIs don't offer control over the ordering of results.

This might be especially common on APIs that offer a snapshot of the N most recent records.

The source connector bases its offset advancement on chronologically ordered results, so latest committed offset would be that of the oldest record instead of the newest. Ignoring this would result in undesired consequences:

  • when offset is used in the query, offset would advance one record at a time and it would require as many queries as records in the response.
  • inability to deduplicate, which would only make the problem above worst.

Potential Solutions

  • automatically sorting records in chronological order, this would be the simplest from user perspective as it doest require even knowing of this concern, however, it would penalize performance, especially when records are reversed
  • allowing the user to specify the sorting direction of records in the response.

Unveiled as part of #19 as a result of inspecting NY Times API

Perhaps a 3-way behavior with two configuration settings?

connector.sorting.enabled as boolean (defaults to false)
connector.sorting.direction as forward or reverse

Allow the end user to not take the performance penalty if they know the source API will sort correctly for them. Also allowing the end user to choose the direction of connector sorting will cover strange use cases easily.

I'm still thinking about it, I was considering the possibility of providing different ordering criteria, or whether to make this extensible, but to be honest, I think I might have been over-engineering in my head.

What you suggest makes perfect sense, I think I'll do something in those lines. If anything I would probably go for reducing it to a single property with a default value, for the sake of simplicity. As in direction=Forward|Reverse, defaulting to Forward, which is the current behavior.

Thanks for your input, and stay tuned! :)

If using just a single property, then make it take None so the connector does no work at all. You are relying on a known, well written API. You still get 3-way behavior that way.

Yes, you are right, I misunderstood the purpose of the flag the first time I read it.

Actually I think it's a 4-way behavior:

  • Not sorted (server-sorted)
    • As-is (your None)
    • Reversed
  • Chronologically sorted
    • Oldest first (your Forward)
    • Newest first (your Reverse)

I think this categorization just made my mind, I think the cleanest would be for it to be two different properties in the end. Using SQL-like naming (default in parenthesis):

order.by: (None)|Timestamp
order.direction: (Asc)|Desc

I guess it's a matter of picking the right naming now.

This has been released as part of v0.7.4

Documentation can be found here:
https://github.com/castorm/kafka-connect-http#sorter