mikegoatly/lifti

Item not getting indexed

markwriter opened this issue · 5 comments

First off, this is an amazing library and I really appreciate it. It's so well organized and documented!

The problem I'm having is this: I have indexed a corpus of 792 objects. The word "broker" appears in four of these objects, but does not appear to be indexed at all. When I build the index and search for the word "broker" it returns 0 search results.

I've experimented with just selecting the TopicContentSearchModels with the word "broker" and indexing those. When I do this, the word "broker" does get indexed and the search works properly, returning four search results.

I also experimented with sorting the original list of objects and building the index that way. When I do that, "broker" gets indexed and the search works properly, returning four search result records.

For reference I've included the search model as well as all my 'sandbox' code. I've serialized the search models and attached a file if you have any interest in reproducing the issue, as well as having some real-world data to play with.

public class TopicContentSearchModel
    {
        public int TopicId { get; set; }
        public string TopicName { get; set; }
        public string Content { get; set; }
    }

private void SetUpTestForLifti()
        {
            //*** BEGIN STUFF FOR LIFTI testing
            //load objects to be indexed
            var jsonFile = File.ReadAllText("C:\\Users\\manderson\\Documents\\umdata.json");
            var searchModels =
                JsonConvert.DeserializeObject<System.Collections.Generic.List<TopicContentSearchModel>>(jsonFile);
            //****SCENARIO 1
            //**** This is the defect - the word 'broker' exists 4 times in this corpus
            //**** but when the search is performed it is not found
            
            var indexWithAllModels = new FullTextIndexBuilder<int>()
                .WithDefaultTokenization(o => o.WithStemming())
                .WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
                .WithTextExtractor<XmlTextExtractor>()
                .WithObjectTokenization<TopicContentSearchModel>(
                    itemOptions => itemOptions
                        .WithKey(c => c.TopicId)
                        .WithField("TopicName", f => f.TopicName)
                        .WithField("Content", f => f.Content))
                .Build();

            indexWithAllModels.AddRangeAsync<TopicContentSearchModel>(searchModels);
            //Test Search A will have zero records - this is the issue
            var searchAllModels = indexWithAllModels.Search("broker").ToList();
            
            //Serialize search models for possible shipping off to Lifti author for help
            var topicJson = JsonConvert.SerializeObject(searchModels);

            var brokerTopics =
                searchModels.Where(x =>
                    x.TopicName.IndexOf("broker", StringComparison.CurrentCultureIgnoreCase) != -1 ||
                    x.Content.IndexOf("broker", StringComparison.CurrentCultureIgnoreCase) != -1
                ).ToList();

            var nonBrokerTopics = searchModels.Except(brokerTopics).ToList();

            //For this test, add topics with "broker" first, then
            //add other items one by one and see if the number
            //of items found when searching for "broker" changes
            var indexWithBrokersThenAddNonBrokerTopics = new FullTextIndexBuilder<int>()
                .WithDefaultTokenization(o => o.WithStemming())
                .WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
                .WithTextExtractor<XmlTextExtractor>()
                .WithObjectTokenization<TopicContentSearchModel>(
                    itemOptions => itemOptions
                        .WithKey(c => c.TopicId)
                        .WithField("TopicName", f => f.TopicName)
                        .WithField("Content", f => f.Content))
                .Build();

            //Add records with term 'broker'
            indexWithBrokersThenAddNonBrokerTopics.AddRangeAsync(brokerTopics);
            //add in each record that does not have broker, see if search returns other than 4
            foreach (var nonBrokerTopic in nonBrokerTopics)
            {
                indexWithBrokersThenAddNonBrokerTopics.AddAsync(nonBrokerTopic);
                var testAddingNonBrokerTopicToBrokerTopics = indexWithBrokersThenAddNonBrokerTopics.Search("broker").ToList();
                if (testAddingNonBrokerTopicToBrokerTopics.Count != 4)
                {
                    throw new Exception($"Look out non-broker topic {nonBrokerTopic.TopicId} threw up");
                }
            }

            //for this test, add all topics without the word "broker"
            //then add in the topics with the word broker one at a time
            //and see if the index can find the words
            var testIndex2 = new FullTextIndexBuilder<int>()
                .WithDefaultTokenization(o => o.WithStemming())
                .WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
                .WithTextExtractor<XmlTextExtractor>()
                .WithObjectTokenization<TopicContentSearchModel>(
                    itemOptions => itemOptions
                        .WithKey(c => c.TopicId)
                        .WithField("TopicName", f => f.TopicName)
                        .WithField("Content", f => f.Content))
                .Build();
            testIndex2.AddRangeAsync(nonBrokerTopics);
            for (var index = 0; index < brokerTopics.Count; index++)
            {
                var brokerTopic = brokerTopics[index];
                testIndex2.AddAsync(brokerTopic);
                var testSearch2 = testIndex2.Search("broker").ToList();
                if (testSearch2.Count != index + 1)
                {
                    throw new Exception($"Look out non-broker topic {brokerTopic.TopicId} threw up");
                }
            }
        }

umdata.zip

Hi @markwriter, thanks for reporting this with a detailed repo and sample data!

I can reproduce the issue, but I haven't worked out exactly what the problem is yet - I'll find some time soon to dig in a bit further.

Ok, a minimal repro is this:

var reproIndex = new FullTextIndexBuilder<int>()
.WithDefaultTokenization(o => o.WithStemming())
.Build();

reproIndex.BeginBatchChange();
await reproIndex.AddAsync(1, "broker");
await reproIndex.AddAsync(3, "broken");
await reproIndex.AddAsync(4, "brokerage");
await reproIndex.CommitBatchChangeAsync();

// Should be 1, is currently 0
Console.WriteLine(reproIndex.Search("broker").Count());

Some observations so far:

  • If you search for broker* you do get matches
  • If the index isn't built in a batch change, i.e. items added one at a time to the index, searching for broker does work.

My conclusion so far is that I think this is an issue with the way that the batch changes logic is building the index tree. I'll try to look into that this evening.

Thanks so much for looking at that, also thanks for boiling down the data required to repro the issue. And, the observations you wrote are helpful as well - very interesting.

@markwriter v3.5.2 is getting pushed to nuget now - let me know how you get on with it once it's there, I think it should fix the issue for you.

As an aside, in your sample code I noticed that you weren't awaiting any of the async methods. This is working for you at the moment because no async configuration has been applied to the index, but just to be on the safe side I would always recommend awaiting them in case this behavior changes.

Thanks very much - I was able to update Lifti and it works perfectly.
I also took your comments about async and fixed my code.
I have a couple questions about some other things but I may use another item to ask them.
Again, really appreciate Lifti as well as the help.