cltk/cltkv1

Determine API paradigms

todd-cook opened this issue · 2 comments

To jump start us, it may be helpful to consider blueprints for design and implementation.
Here's one to start us off and begin thinking.

The Textacy project wraps Spacy and yet provides some very interesting algorithmic guidance for users. https://github.com/chartbeat-labs/textacy

For example, if we wanted to bend CLTK to be more teacher-student friendly, we could wrap (CorpusReader -> Readability stats -> "top N easiest sentences in author, work")
or any visualization tools (one can do "pip install textacy[viz]")

Cons: I don't think we should make Spacy a dependency of the 2.0 and I think tight integration is quite a gamble and tends to end badly. However the textacy story thus far has been interesting in how far one can extend merely by mostly wrapping.

and I think tight integration is quite a gamble

I'm in favor ever-praised (but rarely well defined) "loosely coupled". Practically speaking, In my own programming learning curve, I know I need to be smarter (or more foresightful, maybe) about error handling and sane fallbacks.

don't think we should make Spacy a dependency of the 2.0

I agree about spacy, in fact that we should avoid it entirely (this said, I admire the project very much and use it often elsewhere). For a user's IO, textio, and stanfordnlp all share some basic similarities -- somewhat similar to your and Patrick's readers.

To get the conversation started, here are other libraries' API for dependency parsing

Textio

Examples: https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html

>>> content = '''
...     The apparent symmetry between the quark and lepton families of
...     the Standard Model (SM) are, at the very least, suggestive of
...     a more fundamental relationship between them. In some Beyond the
...     Standard Model theories, such interactions are mediated by
...     leptoquarks (LQs): hypothetical color-triplet bosons with both
...     lepton and baryon number and fractional electric charge.'''
>>> metadata = {
...     'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
...     'author': 'Burton DeWilde',
...     'pub_date': '2012-08-01'}
>>> doc = textacy.Doc(content, metadata=metadata)
>>> print(doc)
Doc(71 tokens; "The apparent symmetry between the quark and lep...")

>>> doc.to_bag_of_words(lemmatize=False, as_strings=False)
{205123: 1, 21382: 1, 17929: 1, 175499: 2, 396: 1, 29774: 1, 27472: 1,
 4498: 1, 1814: 1, 1176: 1, 49050: 1, 287836: 1, 1510365: 1, 6239: 2,
 3553: 1, 5607: 1, 4776: 1, 49580: 1, 6701: 1, 12078: 2, 63216: 1,
 6738: 1, 83061: 1, 5243: 1, 1599: 1}
>>> doc.to_bag_of_terms(ngrams=2, named_entities=True,
...                     lemmatize=True, as_strings=True)
{'apparent symmetry': 1, 'baryon number': 1, 'electric charge': 1,
 'fractional electric': 1, 'fundamental relationship': 1,
 'hypothetical color': 1, 'lepton family': 1, 'model theory': 1,
 'standard model': 2, 'triplet boson': 1}

I don't see particular dependency ability documented in textio. I see these, though:

>>> textacy.spacy_utils.get_objects_of_verb(verb)
>>> textacy.spacy_utils.get_subjects_of_verb(verb)
>>> textacy.spacy_utils.is_negated_verb(token)

@todd-cook it looks like textio is wrapping just a few operations downstream of spacy's dependencies -- do I understand correctly?

StanfordNLP

Examples: https://stanfordnlp.github.io/stanfordnlp/pipeline.html

import stanfordnlp

MODELS_DIR = '.'
stanfordnlp.download('en', MODELS_DIR) # Download the English models
nlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=MODELS_DIR, treebank='en_ewt', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Barack Obama was born in Hawaii.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result


nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'text: {word.text+" "}\tlemma: {word.lemma}\tupos: {word.upos}\txpos: {word.xpos}' for sent in doc.sentences for word in sent.words], sep='\n')

^^^ This isn't in fact an dependency example. See here for crude API, but no example: https://stanfordnlp.github.io/stanfordnlp/depparse.html

Spacy

# Construction via create_pipe
parser = nlp.create_pipe("parser")

# Construction from class
from spacy.pipeline import DependencyParser

parser = DependencyParser(nlp.vocab)
doc = nlp(u"This is a sentence.")
# This usually happens under the hood
processed = parser(doc)

@todd-cook this isn't much, but is this the kind of beginning you're imagining?

I have a simplistic API idea that I'll try to push this weekend.

For example, if we wanted to bend CLTK to be more teacher-student friendly, we could wrap (CorpusReader -> Readability stats -> "top N easiest sentences in author, work")
or any visualization tools (one can do "pip install textacy[viz]")

This example, to be honest, sounds somewhat like a downstream task for users. I haven't thought much about the place the CorpusReader should hold in the new API.

and I think tight integration is quite a gamble and tends to end badly. However the textacy story thus far has been interesting in how far one can extend merely by mostly wrapping.

Maybe the lesson we should learn from all of this is not that we shouldn't wrap certain libraries (clearly we need to), and not that we should tie ourselves to only one (no one project seems perfectly suited to us), but that we should have multiple integrations. If we can get eg spacy outputting decent results for our languages, then by all means let's go for it. Availability of multiple upstream libraries sounds like a smart contingency plan, should any one cause breakage outside of our control.

@todd-cook I will keep this one open as a place to get your feedback on the naive Document object I've created: https://github.com/cltk/cltkv1/blob/master/src/cltkv1/utils/data_types.py#L123