Feature: Processing Large Documents in Chunks

Question

Feature: Processing Large Documents in Chunks

Closed this issue 2 years ago · 2 comments

Motivation

Algolia's documentation recommends breaking up large documents into chunks for better search relevancy of results and also to prevent hitting the search record size limit for a plan.

If we use the Xperience crawler for a page, crack a PDF, or have many structured content fields with blocks of text, we might want to chunk this content into multiple records.

Currently, this library creates 1 Algolia search record for each Page in the Content Tree, so chunking would have to a fully custom solution outside of this integration.

Proposed solution

I'm not sure the best way to introduce this feature at the moment, but my initial idea would be to start where the JObject is created from the page's content.

This could be changed to IList<JObject> and then have everything that populates this collection be updated to work with sets/lists instead of single items.

Or, it might be better to add a new indexing attribute to indicate a field should be chunked and then provide a method for creating the chunks.

Additional context

This would also require configuring an attribute for distinct for the index.

Answer 1 · 2022-08-09T23:21:26.000Z

@seangwright I've pushed a potential solution for this in #26. Can you please take a look and provide some feedback?

Currently, the entire "splitting" process is up to the developers to implement, and there is no new attribute to indicate which properties should be split. Is that acceptable, or should there be some default behavior?

Answer 2 · 2022-08-09T23:29:49.000Z

Thanks! Yup - I'll take a look on Thursday when I'm back to working on the site using Algolia.