getindata/flink-http-connector

Enhance HTTP lookup join to support N:M relationships

MarekMaj opened this issue · 4 comments

The current flink-http-connector lookup join implementation supports a N:1 relationship by returning a maximum of one value for a given lookup key. However, in some scenarios, it may be necessary to support an N:M relationship.

The LookupFunction interface for the lookup join allows returning multiple values for a given lookup key. When multiple values are retrieved from the right-hand side of the join, the lookup join produces one event for each corresponding value. This feature has already been implemented as a reference in the JDBC Connector.

To implement this feature in the HTTP connector, the following considerations must be addressed:

  1. The current assumption that only one event is returned in the response body should be revisited. The connector must support returning a collection of values.
  2. The response format should support pagination. A HATEOAS-compatible interface can be assumed, utilizing links included in the response to navigate through the list of pages. Furthermore, we could consider support for providing API doc
  3. Since the new format is incompatible with the existing one, a configuration flag gid.connector.http.source.lookup.response.unwrap-multiple-values, defaulting to false, should be added for backward compatibility. This default can be changed in the future.

@MarekMaj
An interesting idea. Is this something you are interested in implementing?

It seems to me that it is implicit that we issue a rest call to get one item. Your suggestion implies there is more of a search API. It would be great to see a use case and example of what this would look like in SQL and also how this would map to config, lookup keys, the rest call request and how the response would be mapped to multiple items. I assume the idea is that an array of objects would be returned.

  1. Does your use case require pagination? I don't think Flink supports pagination on joins - so I wonder why we would need them here. HATEOAS-compatible interfaces are conceptually nice, but are very chatty because of the references. I am not sure how widely used they are these days. I had assumed the idea would be call existing search Rest APIs, which probably are not HATEOAS-compatible.

@davidradl
Thank you for the comment!

I assume the idea is that an array of objects would be returned.

Exactly, that’s the idea. The lookup keys won’t be affected. However, we do need to slightly adapt the configuration. Right now, implementation assumes that the entire response body gets transformed into RowData. But with the configuration flag I mentioned earlier, we should be able to change that assumption and instead expect a list of values for the specified table. That change should be backward compatible. The schema for a single RowData entry won’t change, and neither will the SQL. There’s no change in how the request maps to the list of response entries since each result will still be joined with the input event.

I don't think Flink supports pagination on joins

This depends on the connector underlying implementation, that will not affect high level flink api.

HATEOAS-compatible interfaces are conceptually nice, but are very chatty

That's a good point. For this simple API, I don't think it's necessary to implement that. The main requirement I'm emphasizing is the need to introduce pagination, which is essential for a well-designed REST API that returns a list of objects. We can discuss how to implement this in a way that’s both simple and efficient.

Currently, our interface simplifies Flink lookup table api. In a nutshell, with this change we could extend its capabilities, adapting fully to the interface in LookupFunction where multiple values could be returned in lookup:

public abstract Collection<RowData> lookup(RowData keyRow) throws IOException;

I don't think Flink supports pagination on joins

This depends on the connector underlying implementation, that will not affect high level flink api.

👍

HATEOAS-compatible interfaces are conceptually nice, but are very chatty

That's a good point. For this simple API, I don't think it's necessary to implement that. The main requirement I'm emphasizing is the need to introduce pagination, which is essential for a well-designed REST API that returns a list of objects. We can discuss how to implement this in a way that’s both simple and efficient.

I can imagine at least several strategies for pagination:

  • no pagination,
  • next page link is returned in result (HATEOAS),
  • no "next page link" provided. Instead, send consecutive next page requests until result is empty; even in this case there might be multiple approaches, for example: &pageSize=20&pageNumber=N or &limit=20&offset=N*20.

I think that the best approach will be to provide a few popular pagination strategies and enable users to implement their own if need be. Similar idea is already implemented for LookupQueryCreators (see this).

What is more, currently we need to provide 'format' = 'json'. Since the connector is meant to integrate with REST APIs, I believe we may assume that format is always JSON.

At first glance some major refactoring is needed. But I prepared simple version of what we want to achieve: #135