To Do
- additional experimental conditions:
- discourse markers / discourse + laughter
- freezing the utterance encoder
- in-domain pre-training for BERT
- GloVe aggregation for utterances
- BiLSTM
- CNN / average pool?
- methodological improvements:
- use customized BERT vocab/word-piece tokenization for baseline models as well as BERT
- additional corpora:
- improve reporting and analysis:
- macro F1 / macro precision? See: Guillou et. al., 2016 (thanks, Sharid!)
- majority class baseline / tag distribution
- time to train
- number of parameters / task-trained parameters
- not super exciting but maybe we should try:
- DAR model hyperparameter tuning (
hidden_size
,n_layers
,dropout
,use_lstm
) - play with learning rate
- use the BERT Adam optimiser (implements a warm-up)
- DAR model hyperparameter tuning (
- probably future work:
- probing tasks of the hidden layer
- predict dialogue end (or turns to end)
- predict turn change
- dialogue model pre-training
- instead of training the dialogue model to predict DAs directly, predict the encoder representation of the next utterance (unsupervised)
- test/probe by guessing DAs (or other discourse properties) with an additional linear layer
- probing tasks of the hidden layer