Indexing text from nested objects
mikegoatly opened this issue ยท 10 comments
Created from this initial discussion.
Description
Allow for nested objects to provide content for a field. Similar to indexing an array of strings from an object, but an additional delegate needs to be provided to read the text from each of the nested objects in turn.
.WithField(
"Captions", // The name of the field to record all the nested object text under
v => v.CaptionTracks, // The set of nested objects
ct => ct.GetFullText()) // A delegate to read the text for each nested object
@h0lg FYI I've created this issue to track the discussion - it's a bit easier for me to manage here
@h0lg sorry if my previous thoughts have seemed a bit rushed - they have been :)
I've just had 5 minutes to sit down a have a proper think about this with the LIFTI code in front of me and have updated the description above. As it currently stands I know it doesn't satisfy all your requirements - the bit about being able to "tag" search results indexed against a nested object isn't currently there (in your concrete case, the language from CaptionTrack)
"Tagging" in this way is something that I'd need to spend a bit more time thinking about - conceptually it wouldn't necessarily be limited to nested objects; I can see a case for indexing tags for the main object as well. I think it would make sense to split these two requirements apart - this issue should be fairly trivial to implement, whereas tagging would be quite a bit more work.
Is there anything else I've missed from your original requirements?
@mikegoatly Thanks for considering my thoughts and running with them for a bit!
I don't 100% recognize what you describe as "tagging" - do you mean that in my example of the Video
with multiple CaptionTrack
s (each for a different Language
) I said I'd like to identify the CaptionTrack
and suggested Language
?
That's only because if I have multiple CaptionTrack
s, I'd want to know which one the match occurred in. And yeah, that requirement is essential. If I could also write a field query for e.g. Captions_EN
to only look in CaptionTrack
s with Language
equal to EN
, that'd be a bonus.
Do you see why I brought the Language
of the CaptionTrack
into play for identification in my draft? If I index a Video and just know I have a match in a CaptionTrack
, but not in which one, I'd have to search all CaptionTrack
s again to find the matched one.
Sorry for the delay.
Do you see why I brought the Language of the CaptionTrack into play for identification in my draft?
Yeah, sorry, I think I ended up trying to oversimplify things. If there's a need to identify which nested object the text belonged to we're going to need to either:
- Use a dynamically generated field (a bit like I initially described in the initial comment
- Introduce some new mechanism (not sure what right now)
Dynamically generating fields probably makes most sense because searching would work out-of-the-box. The only drawback is that the current design of LIFTI isn't expecting there to be many fields so internally they are stored (and serialized) as a byte. With dynamically generated fields this could become an issue as the number of fields won't be known until after the index has been built, leading to exceptions being thrown when an item is indexed and the field count grows too big.
Some of the fundamentals for how fields are stored against an index will also need to change - currently the fields are baked in when the index is configured and built. With dynamic fields, the IndexFieldLookup
would need to be capable of dynamically registering fields on the fly. Not a dealbreaker, but something to consider.
@h0lg I've done some work over on #69 that I'm hoping will cover your use case here as well. If you're so inclined you can try out a prerelease version of the package as discussed here.
You'll be able use this syntax to register your dynamic fields:
.WithDynamicFields(
v => v.CaptionTracks, // The set of nested objects to treat as dynamic fields
ct => ct.LanguageName, // A delegate to read the name of the field from each nested object
ct => ct.GetFullText()) // A delegate to read the text for each nested object
@mikegoatly Sorry for snoozing through your message from late Feb - or maybe I didn't know how to respond and it got away from me. Thanks for picking up the ball!
I've done some work over on #69 that I'm hoping will cover your use case here as well.
Cool, that looks promising - I'll try that out as soon as I get back to it!
Sticking with my example of a Video
with CaptionTracks
and each CaptionTrack
being identified in the context of the Video
by LanguageName
:
What if the CaptionTrack
had not only the GetFullText()
that I wanted to index, but also a ReverseText
that I'd want to search for hidden demonic incantations? Would I then configure the Video
object tokenizer to create more dynamic fields like this?
.WithDynamicFields(vid => vid.CaptionTracks, ct => "CaptionFullText_" + ct.LanguageName, ct => ct.GetFullText())
.WithDynamicFields(vid => vid.CaptionTracks, ct => "CaptionReverseText_" + ct.LanguageName, ct => ct.ReverseText)
No problem at all! ๐
Yes, that would work. Alternatively, if the two bits of text were related and you wanted them under the same field, the API design allows for you to return an array of strings for the field text, e.g.:
.WithDynamicFields(
vid => vid.CaptionTracks,
ct => ct.LanguageName,
ct => new[] { ct.GetFullText(), ct.ReverseText })
@mikegoatly The prerelease version seems to work fine when I search an index built in memory right before the search. However, when I serialize that index to disk, deserialize it and run the same search on it again, I run into this error:
Field id 5 has no associated field name
at Lifti.IndexedFieldLookup.GetFieldForId(Byte id) in D:\a\1\s\src\Lifti.Core\IndexedFieldLookup.cs:line 35
at Lifti.Querying.Query.<>c__DisplayClass7_0`1.<Execute>b__1(ScoredFieldMatch m) in D:\a\1\s\src\Lifti.Core\Querying\Query.cs:line 69
at System.Linq.Enumerable.SelectListIterator`2.ToList()
at Lifti.Querying.Query.Execute[TKey](IIndexSnapshot`1 index) in D:\a\1\s\src\Lifti.Core\Querying\Query.cs:line 66
at Lifti.FullTextIndex`1.Search(IQuery query) in D:\a\1\s\src\Lifti.Core\FullTextIndex.cs:line 264
at Lifti.FullTextIndex`1.Search(String searchText) in D:\a\1\s\src\Lifti.Core\FullTextIndex.cs:line 253
Does this ring a bell? Otherwise I'll investigate further.
@h0lg yeah, that makes sense. The serialized file format doesn't contain any of the extra information needed about the dynamic fields, so when it's deserialized the index data structures will contain references to field numbers that no longer exist. ๐
I'll have a look at what's needed to extend the format to include and rehydrate it into a new index.
Merging this with #69