This is the code that has been written and used as part of the lexicography project. Sharing is only for the sake of comprehensibility and reproducibility of results, so it may seem complicated or inconvenient in some places. In addition, some scripts have been changed to handle different data, which is why in some cases only certain data can be processed.
The data directory is due to its size not included in this repository. It can be downloaded [here]
This directory contains exemplary data to test the functionality of the scripts.
All final results of the models' predictions are included in full.
It is subdivided into /annotation_results
, /corpora
, /dictionary
and /outputs
.
The downloaded directory /data
should be placed in the root directory of the project.
/annotation_results
contains the results of both human annotation phases and thus also the final results of the models' predictions./corpora
contains all four corpora types (historical and modern for both languages). Additionally, processed versions where every sentence is tokenized and lemmatized using spaCy./dictionary
contains both WordNet and the Swedish dictionary, as well as versions where unique sense-identifier were added to distinguish between different senses./outputs
contains all files that are generated by executing the scripts.
Notebooks in this directory are used to analyze and transform data.
sample_data.ipynb
samples word usages of dictionary headwords from a corpus.reduce_sample.ipynb
reduces the sampled word usages to a set maximum per headword.generate_wsbest.ipynb
generates all files needed for human annotation in PhiTag.reduce_sense_file.ipynb
removes duplicates from thesenses.tsv
file for PhiTag.
extract_context.ipynb
extracts a context (gloss/examples) from the dictionary entries.xl_model_embeddings.ipynb
uses extracted context to generate sense-embeddings.
sort_training_data.ipynb
filters usable usages from the human annotation phase 1 for model tuning.generate_gold_splits.ipynb
randomly divides the usable data into known/unknown for training purposes.vectorize_annotations.ipynb
generates usage embeddings from the word usages of the human annotation phase 1.cross_validation.ipynb
performs cross-validation of a model on the training data generated bygenerate_gold_splits.ipynb
sample_data.ipynb
takes and cleans a sample from the corpora for the model prediction.model_prediction.ipynb
predicts for all usages in the sample whether they are covered by the dictionary based on the tuned threshold.
pre_sort_samples.ipynb
filters the models' predictions and sorts them by similarity to the nearest sense.build_annotation_data.ipynb
generates all files needed for human annotation in PhiTag.
ws_best_analysis.ipynb
analyses the results of the human annotations.