Dropout is not called in the training regime in TransformerEncoder and others

Hi all,

Describe the bug
The TransformerEncoder layer does not take the training argument:

keras-nlp/keras_nlp/layers/modeling/transformer_encoder.py

Line 185 in c5a37bc

def call(self, inputs, padding_mask=None, attention_mask=None):

and so it does not pass it to the dropout layers:

keras-nlp/keras_nlp/layers/modeling/transformer_encoder.py

Line 217 in c5a37bc

x = self._self_attention_dropout(x)

keras-nlp/keras_nlp/layers/modeling/transformer_encoder.py

Line 228 in c5a37bc

x = self._feedforward_dropout(x)

If I understand it correctly, this means that the dropouts are never used.

However, there are many places which does not pass training -- TransformerDecoder and FNetEncoder in layers, and quite a few models in models -- XLNetEncoder, BloomDecoder, GemmaDecoderBlock, etc.

Note that the models which use functional API to create the models should be fine -- there the training argument is passed automatically; however, if subclassing API is used (i.e., def call is used), it is not passed.

Note that this is one of important differences between TF-Keras and Keras 3, because:

in TF-Keras, if no training is passed, the value of backend.learning_phase is used https://github.com/keras-team/tf-keras/blob/95be21afe33fe7d1dc0713ebf3bd4d211d94a065/tf_keras/layers/regularization/dropout.py#L108-L113
however, in Keras 3, the default is to use training=False if not passed: https://github.com/keras-team/keras/blob/ce06c6509db91f334168c66db2e7003101dcd749/keras/layers/regularization/dropout.py#L57-L65

@fchollet I have taken the liberty of adding you here to verify this 🙇 (if I am correct, this will need a non-trivial effort to fix and finetuning on Keras 3 will give suboptimal results until then).

Oh, sorry, I just realized how that now works in Keras 3 🤦‍♂️ https://github.com/keras-team/keras/blob/ce06c6509db91f334168c66db2e7003101dcd749/keras/layers/layer.py#L743-L748

Closing.