curiosity-ai/catalyst

"Collection was modified; enumeration operation may not execute" thrown by await FastTextLanguageDetector.FromStoreAsync in .NET Core 3.1

ProductiveRage opened this issue · 8 comments

The following code throws an InvalidOperationException with the message "Collection was modified; enumeration operation may not execute." on the line that calls await FastTextLanguageDetector.FromStoreAsync when the application targets .NET Core 3.1.

However, it works fine when targeting .NET 5!

using System;
using System.IO;
using System.Threading.Tasks;
using Catalyst;
using Catalyst.Models;
using Mosaik.Core;
using Version = Mosaik.Core.Version;

namespace CatalystSimilarityExample
{
    class Program
    {
        static async Task Main()
        {
            const string modelFolderName = "catalyst-models";
            Storage.Current = new OnlineRepositoryStorage(new DiskStorage(modelFolderName));
            var languageDetector = await FastTextLanguageDetector.FromStoreAsync(
                Language.Any,
                Version.Latest,
                ""
            );
        }
    }
}

The stack trace shows this:

at System.ThrowHelper.ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
at System.Collections.Generic.Dictionary`2.KeyCollection.Enumerator.MoveNext()
at Catalyst.Models.FastText.CompactSupervisedModel()
at Catalyst.Models.FastTextLanguageDetector.<FromStoreAsync>d__5.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at CatalystSimilarityExample.Program.<Main>d__0.MoveNext() in C:\\Users\\Dan\\source\\repos\\ParallelLinqExample\\CatalystSimilarityExample\\Program.cs:line 17

I am seeing this as well, when trying to run the code from the Language Detection sample. I wonder if we should consider this project to be "Early Access" or .Net 5 only? It looks like a very cool project though!

Also, in my testing of the same code the LanguageDetector was only 56.8% accurate. For example code, not good! I run the samples to decide whether I want to use the library for real and that is somewhat less than inspiring. Maybe I should just bite the bullet call spaCy scripts from my application? It would be much better to use pure .Net but only if it works. Maybe I need to look into upgrading to .Net 5?

@gillonba I've seen similar issues recently with the fasttext language detector, need to investigate if something weird going on on net50.

Can you try meanwhile the other model for language detection?

var langDetect = LanguageDetector.FromStoreAsync(Language.Any,Version.Latest,"");

@ProductiveRage fixed the bug with loading the FT model - was a recent memory optimization added to the FT model that broke loading classifier models from disk.
@gillonba building a new version of the Catalyst now, you should be able to test again.
Regarding accuracy, if you could provide me some samples of the data you're testing, could check for you what's the issue.

Good deal, I'll have another look. I just used the dataset and code from the LanguageDetection sample and added a counter to count the number correct vs the total. I don't have it in front of me and I don't recall if it was long or short (long, I think) and of course I was only able to run the LanguageDetector. If you are seeing better results, maybe I am doing something wrong? I look forward to trying FastText!

was the 56% for all the languages in the set, or for only one language?
I think the model won't perform too well on rare languages - probably need some fine tuning to how we tokenize input text...

All languages in the Data file provided with the example. I just count the number of times the predicted language matches the language of the sample. Specifically the Long sample, I think. Is there any guidance at this point on how long the sample should be to provide accurate results?

@ProductiveRage fixed the bug with loading the FT model - was a recent memory optimization added to the FT model that broke loading classifier models from disk.

Sorry, @theolivenbaum, I missed this update somehow - can confirm that I've tested with 3.1 and it works fine now! I'm closing this issue even though there seems to be a question from @gillonba about detection accuracy.. I suspect that that should be a separate issue?