L13-Attention

HHZZ included in UMich-EECS-498

2025-04-19 552 words 3 minutes

Contents

Attention Mechanisms in Neural Networks

introduction

What if Seq to Seq models processed long long sequences?

Attention Mechanisms

the core idea is that using weighted sum, and the coefficient can be learned from the model itself

In math, we do not actually care that wether input is a sequence or not.

given hidden states $h_i$ and the context vector $c$, we can calculate the attention weights as follows:

$$ e_{t, i, j} = f_{att}(s_{t-1}, h_{i,j}) \ a_{t, :, :} = softmax(e_{t, :, :}) \ c_{t} = \sum_{i,j} a_{t, i, j} h_{i,j} $$

“Show, attend and tell” ICML 2015, which setoff “X, Attend, and Y” 😆

Attention Layers

then we wanna abstract this attention mechanism into a general attention layer, as so many work proves that attention mechanism is a crucial component in neural networks

Inputs:

Query vector $q$, Shape: ($D_Q$,)

Input vector $x$, Shape: ($N_X$, $D_Q$)

Similarity function $f_{att}$, at first is (scaled) dot production

Computation:

Similarity: $e$, Shape: ($N_X$, ), usually $e_i = q*X_i/\sqrt{D_Q}$

Attention weights: $a$, Shape: ($N_X$, )

Output vector: $y = \sum_{i=1}a_i x_i$, Shape: ($D_Q$, )

now then we turn into matrix form:

Inputs:

Query matrix $Q$, Shape: ($N_Q$, $D_Q$)

Input matrix $X$, Shape: ($N_X$, $D_Q$)

Computation:

Similarity: $E = Q*X^T/\sqrt{D_Q}$, Shape: ($N_Q$, $N_X$)

Attention weights: $A = softmax(E,dim=1)$, Shape: ($N_Q$, $N_X$)

Output vector: $Y = A*X$, Shape: ($N_Q$, $D_Q$)

$X$ was used twice (for similarity and output), and we wanna separate them into two matrices to make it more clear and flexible… KQV! 😉

Inputs:

Query matrix $Q$, Shape: ($N_Q$, $D_Q$)

Input matrix $X$, Shape: ($N_X$, $D_X$)

Key matrix $W_K$, Shape: ($D_X$, $D_Q$)

Value matrix $W_V$, Shape: ($D_X$, $D_V$)

Computation:

Key vectors: $K = XW_K$, Shape: ($N_X$, $D_Q$)

Value vectors: $V = XW_V$, Shape: ($N_X$, $D_V$)

Similarity: $E = Q*K^T$, Shape: ($N_Q$, $N_X$)

Attention weights: $A = softmax(E,dim=1)$, Shape: ($N_Q$, $N_X$)

Output vector: $Y = AV$, Shape: ($N_Q$, $D_V$), maybe product and sum

Self-Attention Layers

one query for per input vector

Inputs:

Input matrix $X$, Shape: ($N_X$, $D_X$)

Key matrix $W_K$, Shape: ($D_X$, $D_Q$)

Value matrix $W_V$, Shape: ($D_X$, $D_V$)

Query matrix $W_Q$, Shape: ($D_X$, $D_Q$)

Computation:

Query vectors: $Q = XW_Q$, Shape: ($N_X$, $D_Q$)

Key vectors: $K = XW_K$, Shape: ($N_X$, $D_Q$)

Value vectors: $V = XW_V$, Shape: ($N_X$, $D_V$)

Similarity: $E = QK^T$, Shape: ($N_Q$, $N_X$), need to be scaled by $\sqrt{D_Q}$

Attention weights: $A = softmax(E,dim=1)$, Shape: ($N_Q$, $N_X$)

Output vector: $Y = AV$, Shape: ($N_Q$, $D_V$), maybe product and sum

what happens if we change the order of input vectors?

all the value will be the same, but permuted
so we perform the attention on a SET of vectors

to solve this problem, we can use positional encoding to add information about the position of each vector in the sequence, maybe using torch.cat

Masked Self-Attention Layers

force the model only use the past information, and ignore the future information, predicting the next word, at hidden level or similarity level

Multi-Head Self-Attention Layers

spilt the input vectors into equal parts $h$, Query dimension $D_Q$

Summary of Ways of Processing Sequences

RNNs:
- good at long seq
- bad at parallel
1D CNNs:
- bad at long seq
- good at parallel
Self-Attention:
- good at long seq
- good at parallel
- bad at memory

Attention is all you need 😆

Transformer!

then the General Pretrained Transformer (GPT) model’s story begins…