curiosity-ai/catalyst

Add a quick Dependency Parsing example to the readme.

cdibbs opened this issue · 6 comments

Is your feature request related to a problem? Please describe.

I am having trouble figuring out how dependency parsing works. I found the AveragePerceptronDependencyParser and added it to the NLP pipeline after instantiating it with FromStoreAsync(Language.English, Version.Latest, "") but I don't know how to access its output. In particular, the DependencyType property on IToken looked promising, but always seemed to be the empty string.

Describe the solution you'd like

It would be nice to see a couple of quick examples for how to work with it such as extracting the root verb, subject, and object of a sentence.

Describe alternatives you've considered

Additional context

Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var nlp = await Pipeline.ForAsync(Language.English);
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
nlp.Add(await AveragePerceptronDependencyParser.FromStoreAsync(Language.English, Version.Latest, ""));
nlp.ProcessSingle(doc);

Thanks for your work on what looks like a very promising library!

This might be more simplistic than you are looking for if you're looking at the AveragePerceptronDependencyParser and wanting to extract a single root verb but you can tag the tokens in a document with a part-of-speech type like this:

Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var document = new Document("The quick brown fox jumps over the lazy dog", Language.English);
var nlp = await Pipeline.ForAsync(Language.English);
nlp.ProcessSingle(document);
foreach (var sentence in document)
{
    foreach (var word in sentence)
    {
        Console.WriteLine(word.POS + "\t" + word.Value);
    }
}

The output from the above code is this:

DET     The
ADJ     quick
ADJ     brown
NOUN    fox
VERB    jumps
ADP     over
DET     the
ADJ     lazy
NOUN    dog

(The PartOfSpeech enum values - DET, ADJ, etc.. - match the standard abbreviations that you will see used elsewhere, such as this "Part of Speech Tagging" from the tutorial of another NLP library).

@ProductiveRage I appreciate the well-written tips, but you are correct that I need that dependency structure to extract "dobj", "nsubj", and the like.

If this library doesn't quite support that, yet, I could do some educated guessing with simpler sentences in which earlier nouns and pronouns are more likely to be the subject. I'd rather not, though. Another option would be to finagle Python's SpaCy library via Python.NET, but that sounds brittle at best.

Hi @cdibbs -

Just checked quickly - it's strange that the value was supposed to be copied back to the token using this method, which ends up calling the methods here to store the values into the Document data store.

So I stumbled on this line - and then the issue is obvious 🤦‍♂️

It seems like the code to train the dependency parser is just not finished, and it is not yet predicting the labels for dependency type. I'll add this to my backlog - but if you want to take a try on implementing the training, happy to get a PR with this!

@theolivenbaum I wouldn't mind giving it a try. I don't have much experience, though, and I am not sure where the training data is. Is that publicly available somewhere? Thanks!

Hello there!
I was wondering if I could use Catalyst to get a Parse Tree and I found this issue.
Did it go somewhere, or is it stuck in the pile of TODOs?

Thank you.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.