开放存储SDK buildBatchReadSession 时，withSessionId的使用场景

Question

开放存储SDK buildBatchReadSession 时，withSessionId的使用场景

Closed this issue 23 days ago · 3 comments

版本：0.49.0-public + 公有云
目前我遇到一个问题：表a 有1500 分区左右，每个分区大约4-5M (10000行数据左右)
实际上在使用的时候 buildBatchReadSession 用了大约10s左右

我看这里可以 withSessionId，即我可以缓存这个SessionID。
但是我不太明确，如果我缓存了这个东西，下次再去使用的话，是不是需要满足：创建TableBatchReadSession时的 getSplitOption，requiredPartitionColumns，orderedRequiredDataColumns，withFilterPredicate requiredPrunedPartitionSpecs 这些东西与之前的都一样才可以。

scanBuilder.identifier(TableIdentifier.of(table.getDbName(), table.getName()))
                            .withSettings(mcCatalog.getSettings())
                            .withSplitOptions(mcCatalog.getSplitOption())
                            .requiredPartitionColumns(requiredPartitionColumns)
                            .requiredDataColumns(orderedRequiredDataColumns)
                            .withArrowOptions(
                                    ArrowOptions.newBuilder()
                                            .withDatetimeUnit(TimestampUnit.MILLI)
                                            .withTimestampUnit(TimestampUnit.NANO)
                                            .build()
                            )
                            .requiredPartitions(requiredPrunedPartitionSpecs)
                            .withFilterPredicate(filterPredicate)

Answer 1 · 2024-12-04T12:17:28.000Z

不需要满足你说的条件，当指定 SessionID 时，其他的字段将被忽略。

    public TableBatchReadSession createBatchReadSession(TableReadSessionBuilder builder) throws IOException {
        if (builder.getSessionId() == null) {
            return new TableBatchReadSessionImpl(builder.getIdentifier(),
                    builder.getRequiredPartitions(),
                    builder.getRequiredDataColumns(),
                    builder.getRequiredPartitionColumns(),
                    builder.getRequiredBucketIds(),
                    builder.getSplitOptions(),
                    builder.getArrowOptions(),
                    builder.getSettings(),
                    builder.getFilterPredicate());
        } else {
            return new TableBatchReadSessionImpl(builder.getIdentifier(),
                    builder.getSessionId(),
                    builder.getSettings());
        }
    }

Answer 2 · 2024-12-04T12:20:16.000Z

由于 TableBatchReadSession 实现了 Serializable ，比较常用的实践是缓存 TableBatchReadSession 的 ObjectStream，相比发送一次请求 reload session，能够得到更好的效率。

后续我们可能会实现更好的方式，来传输/缓存 TableBatchReadSession

Answer 3 · 2024-12-05T02:39:29.000Z

Get! thanks.