Patch Embedding Dimension in TransUNet

Question

Patch Embedding Dimension in TransUNet

parniash opened this issue 3 years ago · 3 comments

I think the patch embedding dimensions in your implementation might be larger than the original TransUNet paper.

In the original TransUNet, it is mentioned that "For the “base” model, the hidden size D, number of layers, MLP size, and number of heads are set to be 12, 768, 3072, and 12, respectively".
The hidden size D is the embedding dimension of the transformer's output which is set to 12 (unless the authors made a mistake in the order of the numbers). However, in your implementation, it is set to 768. Am I missing something?

Answer 1 · 2021-11-08T22:11:38.000Z

I feel that they made a mistake here. embedding size = hidden size D = 768 is somewhat a default hyperparameter choice inherited from the BERT-base. Given the tensor size prior to embedding (height times width times channel), 12 is too low.

Answer 2 · 2021-11-09T16:35:53.000Z

That makes sense. It's most likely a mistake. One more question. In the paper, what do they mean by "number of layers" which is equal to 768?

Answer 3 · 2021-11-10T21:13:38.000Z

My guess:

"number of layers" = number of transformers = 12
"MLP size" = number of MLP nodes per transformer = 3072
"number of heads" = number of self-attention heads = 12
"hidden size D" = number of embedded dimensions = 768

There are no guarantees, and you can contact the original authors of transunet.