Название исследуемой задачи: | Generating Synthetic Data via Intent-Preserving and Intent-Corrupting Augmentations for Training Dialogue Embeddings |
---|---|
Тип научной работы: | M1P |
Автор: | Алексеев Илья Алексеевич |
Научный руководитель: | Кузнецов Денис |
Научный консультант(при наличии): | степень, Фамилия Имя Отчество |
Text embeddings from pre-trained language models have been proven to be extraordinarily useful for various sentence-level tasks, such as pair classification, similarity estimation, and retrieval. Corresponding models are usually trained on large amounts of clean and diverse data using contrastive loss. Unfortunately, there are no such datasets for the domain of dialogue data. In this work, we describe the process of mining a synthetic dataset of dialogues for contrastive learning with hard negatives. We investigate various augmentation strategies for constructing dialogues with preserved or corrupted intents (positive and negative samples, respectively). To demonstrate the stated cleanliness and diversity, we train a dialogue encoder model and analyze its properties.