Implement sub-chunking for offloads
Opened this issue · 0 comments
nj1973 commented
Sometimes we encounter very large non-partitioned tables or very large single partitions, e.g. > 10TB.
At the moment the segment (i.e. top level table or partition) is the smallest level of chunking we support. If we cannot offload that smallest unit of work in a single pass (e.g. ORA-1555) then our only option is to keep increasing parallelism in the hope of completing before we run out of time.
It has been suggested that we should attempt to break the single segment, be it a table or partition, down into multiple transport jobs.
Example 1:
- Partition P2015 is in a table partitioned by date column TXN_DATE and it 20TB in size
- The table is also sub-partitioned
- Offload could loop through the subpartitions adding them to the staging area one at a time, each split by ROWID ranges for parallelism
- Only after ALL subpartitions have been appended to the bucket does the Offload continue, that way we still get the atomicity we desire
Example 2:
- Partition P2015 is in a table partitioned by date column TXN_DATE and it 20TB in size, there are NO subpartitions
- Offload detects that P2015 > MAX_OFFLOAD_CHUNK_SIZE
- Offload detects that the partition key TXN_DATE (or perhaps some other columns) is not a single value but has a range of values
- Offload identifies n points between the min/max TXN_DATE
- Offload loops through the ranges adding data to the staging area for each, each transport would be split natively by Spark on the range of values
- Only after ALL ranges have been appended to the bucket does the Offload continue, that way we still get the atomicity we desire