/practicum-fall-2023

актуальный репо: https://github.com/voorhs/dialogue-augmentation

Primary LanguageJupyter NotebookMIT LicenseMIT

Test status Test coverage Docs status

Название исследуемой задачи:Generating Synthetic Data via Intent-Preserving and Intent-Corrupting Augmentations for Training Dialogue Embeddings
Тип научной работы:M1P
Автор:Алексеев Илья Алексеевич
Научный руководитель:Кузнецов Денис
Научный консультант(при наличии):степень, Фамилия Имя Отчество

Abstract

Text embeddings from pre-trained language models have been proven to be extraordinarily useful for various sentence-level tasks, such as pair classification, similarity estimation, and retrieval. Corresponding models are usually trained on large amounts of clean and diverse data using contrastive loss. Unfortunately, there are no such datasets for the domain of dialogue data. In this work, we describe the process of mining a synthetic dataset of dialogues for contrastive learning with hard negatives. We investigate various augmentation strategies for constructing dialogues with preserved or corrupted intents (positive and negative samples, respectively). To demonstrate the stated cleanliness and diversity, we train a dialogue encoder model and analyze its properties.

Research publications

Presentations at conferences on the topic of research

Software modules developed as part of the study

  1. A python package mylib with all implementation here.
  2. A code with all experiment visualisation here. Can use colab.