To Do

additional experimental conditions:
- discourse markers / discourse + laughter
- freezing the utterance encoder
- in-domain pre-training for BERT
- GloVe aggregation for utterances
  - BiLSTM
  - CNN / average pool?
methodological improvements:
- use customized BERT vocab/word-piece tokenization for baseline models as well as BERT
additional corpora:
- AMI
improve reporting and analysis:
- macro F1 / macro precision? See: Guillou et. al., 2016 (thanks, Sharid!)
- majority class baseline / tag distribution
- time to train
- number of parameters / task-trained parameters
not super exciting but maybe we should try:
- DAR model hyperparameter tuning (hidden_size, n_layers, dropout, use_lstm)
- play with learning rate
- use the BERT Adam optimiser (implements a warm-up)
probably future work:
- probing tasks of the hidden layer
  - predict dialogue end (or turns to end)
  - predict turn change
- dialogue model pre-training
  - instead of training the dialogue model to predict DAs directly, predict the encoder representation of the next utterance (unsupervised)
  - test/probe by guessing DAs (or other discourse properties) with an additional linear layer

winobes/DistributionalDiscourse