NVIDIA/nv-ingest

[FEA]: Support arbitrary python functions to determine document split points

Opened this issue · 0 comments

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Currently preventing usage

Please provide a clear description of problem this feature solves

In an ideal world we'd have cleanly extracted document section header metadata including location.

Then we could use the location of such document demarcations to support splitting on those demarcations.

However, some separators are arbitrary text content not likely to ever be identified by an ootb model. As a result, users would like to be able to run an arbitrary python function which can return split locations.

Describe the feature, and optionally a solution or implementation and any alternatives

See above

Additional context

No response