gibber: A Python repository from szmer

If you want to run prepared experiments, review oprions in wsd_config.py and run
experim.py (`python experim.py` or `python3 experim.py` depending on your
environment).

This document describes how to use `gibber` package.

> from gibber.wsd import add_word_neighborhoods, predict_sense
> from gibber.annot_corp_loader import load_wn3_corpus
> sents, words_and_tags = load_wn3_corpus(path_to_corpus)
> add_word_neighborhoods(words_and_tags) # a collection of (lemma, tag) pairs
# predict the sense of word with index 1 in the sentence from sents collection
> predict_sense('wyjść', 'praet:sg:m1:perf', [l for (w, l, s, t) in sents[5]], 1):
(('82941_czasownik_10', tensor(1.00000e-02 * [[ 4.7619]]), 21, 21), None, ('82939_czasownik_16', tensor(1.00000e-02 *
[[ 7.6923]]), 21, 21), ('10303_czasownik_1', tensor([[ 0.1197]]), 21, 21), ('82931_czasownik_14', tensor(1.00000e-02 *
[[ 4.7619]]), 21, 0), ('10303_czasownik_1', tensor([[ 0.1197]]), 21, 21))

The returned list contains predictions made with various neighbor subsets. See
below, and the paper, for explanation.

NOTE it is required to first call `add_word_neighborhoods` on words you wish to
disambiguate. This retrieves necessary information from Wordnet XML. It is best,
whenever possible, to call this function on all the words you'll want to process
at once.

Experim.py uses functions from `gibber.annot_corp_loader` which can import
supported corpora and provide a list of sentences and a set of unique lemmas.
These are:

load_skladnica_wn2(skladnica_path, skladnica_sections_index_path):
    """Returns a list of sentences, as list of lemmas, and a set of words.
    The first has pairs: (form, lemma, true_sense, tag), the second:
    (lemma, tag)"""
load_wn3_corpus(annot_sentences_path)
load_kpwr_corpus(kpwr_path)

For meaning of the arguments consult wsd_config.py.


Configuration
-------------

See the file wsd_config in this directory. The comments there should contain
necessary explanations.

Conventions
-----------
Forms and lemmas should be converted to lowercase.

For referring to wordnet senses, we use a convention of symbols aaa_111, where
aaa is the lexical id, and 111 is the variant number. If one of these is empty
(as it is in the loaded sense-annotated data), it is enough for the other
information to match. The `predict_senses` function will return full information,
in the format aaa_POS_111, where POS is a POS type used in plWordnet.

For comparing senses, the function `gibber.wsd.sense_match` can be used. It
takes two arguments: the first should be a full sense specification, the
second may be partial.

Relation subsets
----------------
The `predict_sense` returns a tuple of decisions made by taking into account
various subsets of Wordnet relations when collecting word neighborhoods. Broadly
speaking, they can be described this way:
1 - synonymy.
2 - synonymy and antonymy.
3 - synonymy and meronymy(-like) relations.
4 - all of the above.

Additionally, word descriptions can be used:
5 - only descriptions.
6 - combining evidence from descriptions with that from relations.

Using unit descriptions
-----------------------

If you want to use evidence from unit descriptions, you need to configure the
Concraft parser (in wsd_config.py). It is recommended to run mass_parsing.py
script earlier, which can parse all descriptions needed for tagging the corpus.
See wsd_config.py for its configuration.

Functions
---------

These live in `gibber.wsd`.

add_word_neighborhoods(words):
    """Retrieve neighbor senses from Wordnet for all given (lemma, tag) pairs and store them
    in internal representation for later prediction."""

def predict_sense(token_lemma, tag, sent, token_id, verbose=False, discriminate_POS=True):
    """Get token_lemma and tag as strings, the whole sent as a sequence of strings (forms or lemmas),
    and token_id indicating the index where the token is in the sentence. Return decisions made when
    using subsets 1, 2, 3, 4 of relations, as tuples (estimated probability, sense, retrieved_sense_count,
    considered_sense_count) or None's if no decision. Retrieved_sense_count indicates how many
    senses were found for the lemma, and considered_sense_count for how many we were able to find
    a neighborhood with the given subset of relations. If discriminate_POS is set, only senses
    of Wordnet POS matching the provided tag will be considered."""
szmer/gibber