If you want to run prepared experiments, review oprions in wsd_config.py and run experim.py (`python experim.py` or `python3 experim.py` depending on your environment). This document describes how to use `gibber` package. > from gibber.wsd import add_word_neighborhoods, predict_sense > from gibber.annot_corp_loader import load_wn3_corpus > sents, words_and_tags = load_wn3_corpus(path_to_corpus) > add_word_neighborhoods(words_and_tags) # a collection of (lemma, tag) pairs # predict the sense of word with index 1 in the sentence from sents collection > predict_sense('wyjść', 'praet:sg:m1:perf', [l for (w, l, s, t) in sents[5]], 1): (('82941_czasownik_10', tensor(1.00000e-02 * [[ 4.7619]]), 21, 21), None, ('82939_czasownik_16', tensor(1.00000e-02 * [[ 7.6923]]), 21, 21), ('10303_czasownik_1', tensor([[ 0.1197]]), 21, 21), ('82931_czasownik_14', tensor(1.00000e-02 * [[ 4.7619]]), 21, 0), ('10303_czasownik_1', tensor([[ 0.1197]]), 21, 21)) The returned list contains predictions made with various neighbor subsets. See below, and the paper, for explanation. NOTE it is required to first call `add_word_neighborhoods` on words you wish to disambiguate. This retrieves necessary information from Wordnet XML. It is best, whenever possible, to call this function on all the words you'll want to process at once. Experim.py uses functions from `gibber.annot_corp_loader` which can import supported corpora and provide a list of sentences and a set of unique lemmas. These are: load_skladnica_wn2(skladnica_path, skladnica_sections_index_path): """Returns a list of sentences, as list of lemmas, and a set of words. The first has pairs: (form, lemma, true_sense, tag), the second: (lemma, tag)""" load_wn3_corpus(annot_sentences_path) load_kpwr_corpus(kpwr_path) For meaning of the arguments consult wsd_config.py. Configuration ------------- See the file wsd_config in this directory. The comments there should contain necessary explanations. Conventions ----------- Forms and lemmas should be converted to lowercase. For referring to wordnet senses, we use a convention of symbols aaa_111, where aaa is the lexical id, and 111 is the variant number. If one of these is empty (as it is in the loaded sense-annotated data), it is enough for the other information to match. The `predict_senses` function will return full information, in the format aaa_POS_111, where POS is a POS type used in plWordnet. For comparing senses, the function `gibber.wsd.sense_match` can be used. It takes two arguments: the first should be a full sense specification, the second may be partial. Relation subsets ---------------- The `predict_sense` returns a tuple of decisions made by taking into account various subsets of Wordnet relations when collecting word neighborhoods. Broadly speaking, they can be described this way: 1 - synonymy. 2 - synonymy and antonymy. 3 - synonymy and meronymy(-like) relations. 4 - all of the above. Additionally, word descriptions can be used: 5 - only descriptions. 6 - combining evidence from descriptions with that from relations. Using unit descriptions ----------------------- If you want to use evidence from unit descriptions, you need to configure the Concraft parser (in wsd_config.py). It is recommended to run mass_parsing.py script earlier, which can parse all descriptions needed for tagging the corpus. See wsd_config.py for its configuration. Functions --------- These live in `gibber.wsd`. add_word_neighborhoods(words): """Retrieve neighbor senses from Wordnet for all given (lemma, tag) pairs and store them in internal representation for later prediction.""" def predict_sense(token_lemma, tag, sent, token_id, verbose=False, discriminate_POS=True): """Get token_lemma and tag as strings, the whole sent as a sequence of strings (forms or lemmas), and token_id indicating the index where the token is in the sentence. Return decisions made when using subsets 1, 2, 3, 4 of relations, as tuples (estimated probability, sense, retrieved_sense_count, considered_sense_count) or None's if no decision. Retrieved_sense_count indicates how many senses were found for the lemma, and considered_sense_count for how many we were able to find a neighborhood with the given subset of relations. If discriminate_POS is set, only senses of Wordnet POS matching the provided tag will be considered."""