/BaySIDshot

Master Thesis on "Analyzing Slot and Intent Detection for Upper German Dialects via Zero-Shot Transfer Learning" and Paper on "Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study"

Primary LanguageTeXMIT LicenseMIT

BaySIDshot

This repository contains both code and submodules used for my Masters Thesis entitled Analyzing Slot and Intent Detection for Upper German Dialects via Zero-Shot Transfer Learning.

The origin of the portmanteau BaySIDshot is rooted in the creation of a new additional Bavarian test and validation set depicting the dialect spoken in the Munich region in order to further analyze zero-shot transfer learning performance on slot and intent detection (SID) for this and other Upper German dialects. Both de-by test and validation .conll files can be found at the root of this directory. Thus, this work presents a parallel extension to the Upper Bavarian dataset translated and annotated by Winkler et al. (2024), similarly building on and extending the xSID approach and data format initiated by van der Goot et al. (2021b). For running the baseline and extended experiments, recursively cloning all submodules and especially MaChAmp by van der Goot et al. (2021a) is necessary. Performing the following command will load all submodules, including those containing auxiliary task data.

git clone https://github.com/XaverKrueckl/BaySIDshot.git --recurse-submodules

As a main approach to analyze and enhance zero-shot SID transfer learning to Upper German dialects, auxiliary task training is performed on four different task types from three Bavarian target language datasets. In concrete, these are a Bavarian Universal Dependencies (UD) set by Blaschke et al. (2024), a Bavarian Named Entity Recognition (NER) set by Peng et al. (2024) and Masked-Language-Modeling (MLM) data taken as preprocessed sentences from Artemova and Plank (2023).

In order to recreate both baseline and extended experiment results, running the respective notebooks in Google Colaboratory is required. To do so, getting a Pro subscription on the cloud-based service is recommended for the baseline and required for the extended experiments for which larger and stable GPU ressoures are necessary. Similarly, a Google account with access to Google Drive is suggested in order to save models and outputs out of the runtime environment.

Starting the notebooks will establish a mount on Google Drive and then clones this repository recursively. If required, data preparation scripts from the respecitve scripts directory are run. The created datasets, for which the paths are set accordingly in the configurations for MaChAmp are only present during runtime. If their creation fails, data which needs to be pre-processed but is openly available is given in a manual data directory. In order to use this data, the paths to the data in the configuration files need to be adjusted, though. Also please cite the respective data source in this case!

After installing the required modules for MaChAmp, the notebook then checks for GPU access and general operability. For each experiment, the respective configuration and parameter files are inspected before the train command is started. After the fine-tuning process has finished, each resulting model in the log files is saved to Google Drive, carrying the experiment name. Before the prediction on the final model starts, the respecitve evaluation data is prepared via scripts but also available as gold files in the manual data. In a rather complex evaluation cell, a script is prepared that evaluates the final model, which only needs to be loaded once, on each evaluation file from the prepared set. The predicted output files are saved to the respective model on Google Drive. Similarly, a separate evaluation script is run to get the results in a clear json file containing three objects depending on the extent of the evaluation set. After having run an experiment on multiple random seed, further scripts can be used to get the average over these runs, to turn this into a .csv document for usage in a LaTex tables generator and to produce confusion matrices on the results of intent classification. In their current state, the notebooks run all experiments on random_seed=1234. The other two seeds that were used are 6543 and 8446. These need to be set for each experiment and in the respective model names!

Finally, please find the Tex and Bib files as well as styleguide, figures and the final thesis pdf produced using the cloud based LaTex editor Overleaf here.

When using this work or data utilized in it, please cite the respective papers! For questions and access to unpublished NER data, please contact me!

Cheers, Xaver