About At

Question

About At

xyy-ict opened this issue 3 years ago · 2 comments

xyy-ict commented 3 years ago

Good work! I have two questions:

The generation of the sparse domain attention (SDA) vector in the code is different from that in the paper. Why? In your paper, there is an embedding layer. But in the code, the sparse domain attention (SDA) vector is initialized and regularized by their norm.
Why target domain attention can be generalized by only using the source data? It seems that At is generalized by feeding source data to a different path with a mask that is initialized differently. Why does this work?

Answer 1 · 2022-01-05T09:48:22.000Z

Hi there.

Please carefully read the code and its comment, we do use the embedding layer as in the paper (feat_bootleneck_sdaE). Using randomly generating ones also work but lead to performance degradation a bit.
As mentioned in Sec. 3.4: "first we train the model on Ds with the cross-entropy loss, with both source and target domain attention As, At, this is to provide a good initialization for target adaptation where only At is engaged". The target adaptation must start from a good source model, which is the reason we train both As and At on source data at the beginning.

Answer 2 · 2022-01-07T02:48:16.000Z

Thank you very much for answering my questions.

For 1, I'm sorry that I didn't read the code carefully. The generation process of the SDA in the code is the same as that in the paper.

For 2, I think now I can partially understand why At works. Because it works well on the source domain and is different from As, it may work on the target domain after adaptation.