something about the additive attention

Hello, I think If you want the additive attention be able to deal with batch, while inputs are like these
Inputs: query, value
         - **query** (batch_size, q_len, hidden_dim): tensor containing the output features from the decoder.
         - **value** (batch_size, v_len, hidden_dim): tensor containing features of the encoded input sequence
    
the code in forward function should be like this:
       def forward(self, query: Tensor, key: Tensor, value: Tensor):
        score = self.score_proj(
            torch.tanh(self.key_proj(key.unsqueeze(1)) + self.query_proj(query.unsqueeze(2)) + self.bias)).squeeze()
        attn = F.softmax(score, dim=-1)
        context = torch.bmm(attn, value)
        return context, attn

otherwise, the size of self.key_proj(key.unsqueeze(1)) and self.query_proj(query.unsqueeze(2) will be dismatch on second dimension and can not be added


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

something about the additive attention #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

something about the additive attention #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions