Benchmark models from the paper

Hi!

I am reproducing your models that were implemented in the "Pretrained Language Models for Sequential Sentence Classification" paper. It is a really great work and we have managed to implement your novel models successfully in TF. We are, however, unsure about the implementation details of the benchmark models, i.e. BERT + Transformer and BERT + Transformer + CRF. Could you be so kind to clarify some questions about these models? I couldn't find them in this repo.

In the paper, you say:
“We compare our approach with two strong BERT-based baselines, finetuned for the task. The first baseline, BERT+Transformer, uses the [CLS] token to encode individual sentences as described in Devlin et al. (2018). We add an additional Transformer layer over the [CLS] vectors to contextualize the sentence representations over the entire sequence. The second baseline, BERT+Transformer+CRF, additionally adds a CRF layer. Both baselines split long lists of sentences into splits of length 30 using the method in §2 to fit into the GPU memory.”

Regarding the input shape for these benchmark models, are these sentences also concatenated together with [SEP] tokens? Or are they inserted separately to the BERT, where the batch size is a number of sentences?

Regarding the BERT+Transformer, could you explain me how it was exactly implemented?
My guess is that you input each sentence separately to the BERT, extract all [CLS] tokens from the batch, and then input it to Encoder layer followed by the Dense and Softmax layers. Is my reasoning correct? Was the decoder layer used as well?

Regarding the BERT+Transformer+CRF, where did you exactly put the CRF layer? Was it straight after the Encoder layer, after the Dense layer, or maybe after the SoftMax layer?

Kind regards,
Kacper Kubara

Regarding the input shape for these benchmark models, are these sentences also concatenated together with [SEP] tokens? Or are they inserted separately to the BERT, where the batch size is a number of sentences?

For the input to the benchmark models, each sentence is separately embedded using BERT, and then the representation of the [CLS] token is used.

Regarding the BERT+Transformer, could you explain me how it was exactly implemented?

This is where it is applied:

sequential_sentence_classification/sequential_sentence_classification/model.py

Line 165 in 8ddc021

embedded_sentences = self.self_attn(embedded_sentences, sent_mask)

. It is using this module (encoder only): https://github.com/allenai/allennlp/blob/v0.9.0-3971/allennlp/modules/seq2seq_encoders/stacked_self_attention.py

Regarding the BERT+Transformer+CRF, where did you exactly put the CRF layer? Was it straight after the Encoder layer, after the Dense layer, or maybe after the SoftMax layer?

The CRF is applied here:

sequential_sentence_classification/sequential_sentence_classification/model.py

Line 205 in 8ddc021

else:

. So, after both the transformer encoder layer referenced above, and the final feedforward layer that has output size equal to the number of labels (but is applied to the logits, not the softmax output).

Great, that clarifies a lot. Thanks!