trinodb/trino

Enhance Documentation around spooling protocol

lozbrown opened this issue · 3 comments

  1. JDBC driver documentation mentions:

The JVM process using the JDBC driver must have network access to the spooling object storage.

Is this true for all retrieval modes? If so can this be made more clear? or otherwise which retrieval modes it is required for. Seems like it might not be required for COORDINATOR_PROXY and WORKER_PROXY

  1. shared-secret-key
    Bearing in mind that many users will be using a role that's assumed on the host/container.
    Is this passed to the end client? If this key has access to more than the spooled data is there a data leak risk here?
    What permissions are required here?

In my organisation I don't think i can grant access to generate presigned uris, I imagine this is similar for others

  1. retrieval modes
    some suggestions about which is the most performant and trade offs, something like this (all guesswork):
retrieval-mode Storage Access required High load on Coordinator High load on Worker
STORAGE true false false
COORDINATOR_STORAGE_REDIRECT True false false
COORDINATOR_PROXY false true false
WORKER_PROXY false false true

For WORKER_PROXY Will being involved in spooling prevent a worker from completing graceful shutdown?
If the coordinator is typically behind a LoadBalancer (most EKS deployments) that does the SSL, what URI is this going to be giving out. LoadBalancer typically does not cover workers

@wendigo

  1. Correct. It's requires for STORAGE and COORDINATOR_STORAGE_REDIRECT
  2. No, it's not passed to the clients, it stays on the nodes only and it's used only there. It's not used to encrypt data itself. The presigned-URI is generated based on the authentication method/permissions of the filesystem configuration.
  3. the table is correct. it won't prevent worker from graceful shutdown. client can retry segment retrieval using initial URL that points to the coordinator which will choose the next available worker.

The coordinator/worker URI follows the same construction logic as any other URI - if the coordinator is behind the LB, it will use X-Forwarded-* headers to generate URIs.

Fyi also @martint .. maybe we should discuss the different modes some more

The feature and docs are mostly centred around the STORAGE retrieval mode and its usage and benefits, since that is the default.

Having said that .. I am happy to review any PRs to clarify these aspects in the docs.