non-recorded-sense-detection: A Jupyter Notebook repository from drvenabili

This is the code that has been written and used as part of the lexicography project. Sharing is only for the sake of comprehensibility and reproducibility of results, so it may seem complicated or inconvenient in some places. In addition, some scripts have been changed to handle different data, which is why in some cases only certain data can be processed.

data

The data directory is due to its size not included in this repository. It can be downloaded [here]

This directory contains exemplary data to test the functionality of the scripts. All final results of the models' predictions are included in full. It is subdivided into /annotation_results, /corpora, /dictionary and /outputs.

The downloaded directory /data should be placed in the root directory of the project.

/annotation_results contains the results of both human annotation phases and thus also the final results of the models' predictions.
/corpora contains all four corpora types (historical and modern for both languages). Additionally, processed versions where every sentence is tokenized and lemmatized using spaCy.
/dictionary contains both WordNet and the Swedish dictionary, as well as versions where unique sense-identifier were added to distinguish between different senses.
/outputs contains all files that are generated by executing the scripts.

data analysis

Notebooks in this directory are used to analyze and transform data.

human annotation phase 1

sample_data.ipynb samples word usages of dictionary headwords from a corpus.
reduce_sample.ipynb reduces the sampled word usages to a set maximum per headword.
generate_wsbest.ipynb generates all files needed for human annotation in PhiTag.
reduce_sense_file.ipynb removes duplicates from the senses.tsv file for PhiTag.

model sense embedding

extract_context.ipynb extracts a context (gloss/examples) from the dictionary entries.
xl_model_embeddings.ipynb uses extracted context to generate sense-embeddings.

model tuning

sort_training_data.ipynb filters usable usages from the human annotation phase 1 for model tuning.
generate_gold_splits.ipynb randomly divides the usable data into known/unknown for training purposes.
vectorize_annotations.ipynb generates usage embeddings from the word usages of the human annotation phase 1.
cross_validation.ipynb performs cross-validation of a model on the training data generated by generate_gold_splits.ipynb

model prediction

sample_data.ipynb takes and cleans a sample from the corpora for the model prediction.
model_prediction.ipynb predicts for all usages in the sample whether they are covered by the dictionary based on the tuned threshold.

human annotation phase 2

pre_sort_samples.ipynb filters the models' predictions and sorts them by similarity to the nearest sense.
build_annotation_data.ipynb generates all files needed for human annotation in PhiTag.

human annotation analysis

ws_best_analysis.ipynb analyses the results of the human annotations.

drvenabili/non-recorded-sense-detection