delta-io/delta-sharing

Support Pagination in QueryTable and QueryTableChanges APIs

charlenelyu-db opened this issue · 0 comments

This is a proposal to add pagination support in QueryTable API and QueryTableChanges API.

Motivation

Currently, we don't have a mechanism to restrict the number of files returned per query table request. When reading from tables that contain millions of files, the server may not be able to process such a substantial volume of files, leading to issues like timeouts or exceeding resource limits. Additionally, the client may struggle to handle large responses efficiently. This limitation becomes a performance bottleneck for the Delta Sharing service.

By introducing pagination in data access APIs, we can control the number of files returned in each API call. This will result in a more scalable Delta Sharing server and client solution.

Protocol Change

We propose the following protocol changes:

QueryTable

HTTP Request Value
Method

POST

Headers

Authorization: Bearer {token}
Content-Type: application/json; charset=utf-8

URL

{prefix}/shares/{share}/schemas/{schema}/tables/{table}/query

URL Parameters

No Change

Body

Add two optional fields:

  • maxFiles (type: Int, optional): the maximum number of files to return in one page. If the number of available files is larger than maxFiles, the response will provide a nextPageToken that can be used to get the next page of results in subsequent requests. This is a hint for the server, and the server may not honor it. The server that supports pagination should return no more than this limit, but it can return fewer. The client should check nextPageToken in the response to determine if there are more available. Must be positive.
  • pageToken (type: String, optional): specifies the page token to use to retrieve the subsequent page. Set pageToken to the nextPageToken returned by a previous request to get the next page of results.

Example:

POST {prefix}/shares/vaccine_share/schemas/acme_vaccine_data/tables/vaccine_patients/query

{
  "maxFiles": 123,
  "pageToken": "..."
}
200: The tables were successfully returned.
HTTP Response Value
Headers

Content-Type: application/x-ndjson; charset=utf-8
Delta-Table-Version: {version}

Body (example)
{
  "protocol": {
    "minReaderVersion": 1
  }
}
{
  "metaData": {
    "id": "string",
    "format": {
      "provider": "parquet"
    },
    "schemaString": "string",
    "partitionColumns": [
      "date"
    ]
  }
}
{
  "file": {
    "url": "string",
    "id": "string",
    "partitionValues": {
      "date": "2021-04-28"
    },
    "size":573,
    "stats": "string"
  }
}
{
  "endStreamAction": {
    "nextPageToken": "string"
  }
}

Note: the endStreamAction JSON wrapper object must be returned as the last line in the response. If there are no more pages available, the server may not return a nextPageToken string, or it may return an empty string. The client must handle all cases.

QueryTableChanges

HTTP Request Value
Method

GET

Headers

Authorization: Bearer {token}

URL

{prefix}/shares/{share}/schemas/{schema}/tables/{table}/changes

URL Parameters

No Change

Query Parameters

Add two optional fields:

  • maxFiles (type: Int, optional): the maximum number of files to return in one page. If the number of available files is larger than maxFiles, the response will provide a nextPageToken that can be used to get the next page of results in subsequent requests. This is a hint for the server, and the server may not honor it. The server that supports pagination should return no more than this limit, but it can return fewer. The client should check nextPageToken in the response to determine if there are more available. Must be positive.
  • pageToken (type: String, optional): specifies the page token to use to retrieve the subsequent page. Set pageToken to the nextPageToken returned by a previous request to get the next page of results.

Example:

GET {prefix}/shares/vaccine_share/schemas/acme_vaccine_data/tables/vaccine_patients/changes?startingVersion=0&endingVersion=2&maxFiles=123&pageToken=...

200: The tables were successfully returned.
HTTP Response Value
Headers

Content-Type: application/x-ndjson; charset=utf-8
Delta-Table-Version: {version}

Body (example)
{
  "protocol": {
    "minReaderVersion": 1
  }
}
{
  "metaData": {
    "id": "string",
    "format": {
      "provider": "parquet"
    },
    "schemaString": "string",
    "partitionColumns": [
      "date"
    ],
    "configuration": {
      "enableChangeDataFeed": "true"
    }
  }
}
{
  "cdf": {
    "url": "string",
    "id": "string",
    "partitionValues": {
      "date": "2021-04-28"
    },
    "size":573,
    "timestamp": 1652141000000,
    "version": 1
  }
}
{
  "endStreamAction": {
    "nextPageToken": "string"
  }
}

Note: the endStreamAction JSON wrapper object must be returned as the last line in the response. If there are no more pages available, the server may not return a nextPageToken string, or it may return an empty string. The client must handle all cases.